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Preface 


Neyman (1962) referred to the empirical Bayes approach as a 
breakthrough in the theory of statistical decision making, and in the 
time since the publication of the first edition of Empirical Bayes 
Methods there have certainly been many contributions to this theory. 
A measure of the importance of a theory is its impact on the practice 
of Statistics. The empirical Bayes approach has not revolutionized the 
practice of Statistics, but there can be little argument that it has had a 
telling influence on the thinking of many statisticians, and on their 
practice in certain areas of application. One of the objects in 
preparing this new edition was to collect and present several practical 
examples of the application of empirical Bayes ideas and techniques 
so as to give an indication of the sorts of problems in which they may 
be useful. It is worth pointing out that meta analysis is now regarded 
as a highly desirable undertaking, especially in the social sciences and 
in medicine. It has clear connections with empirical Bayes. 

The empirical Bayes approach can be thought of as a way of 
looking at data arising in a sequence of similar experiments. It has 
competitors, and the relationships between them has received a good 
deal of attention in recent publications by many authors. A discussion 
of alternatives to empirical Bayes is given in Chapter 7. 

Some topics in empirical Bayes theory have been given scant 
attention in this book, notably the question of rates of convergence of 
risks of certain methods. This is not to deny their theoretical interest. 
We have concentrated on topics which appear to have more 
immediate practical relevance. For example, an examination of 
applications suggests that linear Bayes and empirical Bayes methods 
are important. Various studies, including those of robustness by 
several authors, indicate that the use of parametric priors is more 
readily defensible than might have been suggested in the first edition 
of this book. 
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PREFACE 


Briefly in summary, the main changes to Empirical Bayes Methods 
are: inclusion of more details of published accounts of applications, 
more emphasis on linear EB methods, an account of some competi¬ 
tors of EB, a chapter on interval estimation, more material on 
multiparameter problems. 



Notation and Abbreviations 


A n — A: A„ tends to A in probability 

d 

X = F: the distribution of random variable X is F 

X = N(fi,a 2 ): the distribution of X is normal with mean n and 
variance a 1 

X = Bin(n, 6): the distribution of X is Binomial with index n and 

probability parameter 6 

m.g.f.: moment generating function 

c.f.: characteristic function 

c.d.f.: cumulative distribution function 

p.d.f.: probability density function 

r.v.: random variable 

p.d.: probability distribution 

EB: empirical Bayes 

a.o.:asymptotically optimal 

FB: full Bayesian 

CD: compound decision 
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CHAPTER 1 


Introduction to Bayes and empirical 
Bayes methods 


1.1 The problem, Bayes conventional and empirical Bayes methods 

Empirical Bayes (EB) and related techniques come into play when 
data are generated by repeated execution of the same type of random 
experiment. The individual experiments are often called component 
experiments, or simply components. It is convenient when consider¬ 
ing the data from a particular component to think of it as the current 
component which has been preceded by the other components. 
Empirical Bayes methods provide a way in which such historical 
data can be used in the assessment of the current results. This 
temporal view of the data sequence is a convenience and does not 
play an active role in EB analysis. 

An early example of an EB nature is given by von Mises (1943); see 
also Chapter 8, section 8.3.1. In examining the quality of a batch of 
water for possible bacterial contamination m = 5 samples of a given 
volume are taken. A sample registers a positive result if it contains at 
least one bacterium. Interest centres on the probability, 6, of a 
positive result. Typically there are many repetitions of this experi¬ 
ment with different batches, and the probability 0 can be regarded as 
varying randomly between experiments according to a prior distri¬ 
bution G(d). For a given 6 the probability of x positive results in 
m = 5 samples is 

p(x|0) = Qr(l-0) 5 -*, 

and in repetitions of the same procedure with different batches the 
marginal distribution of the number of positive results in five 
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samples is the mixed binomial distribution 


PgM = 



0*(i -ef-* 


dG(d). 


If the distribution G is known, a Bayesian analysis of the current 
experiment can be performed. For example, a Bayes point estimate 
of 6 can be calculated as a possible competitor for the classical 
maximum likelihood estimate. When the prior, or mixing, distri¬ 
bution G is not known it is possible to estimate it by using the 
observed marginal distribution of the x values. The essence of the EB 
method in this case is that all calculations of a Bayes nature are 
performed after replacing G by its estimate. In the example discussed 
by von Mises there are N = 3420 observations from the marginal 
population characterized by p G (x), and particular attention is paid to 
the estimation of G. 

Much of the work in EB methods over the past two decades or so 
has been stimulated by Robbins in papers beginning with the 
reference Robbins (1955) where the terminology ‘empirical Bayes’ 
was introduced. However, as the examples in Chapter 8 show, it has 
become clear that the applicability of EB ideas is much wider than 
might have been suggested by the earlier writings. It may be noted 
especially that EB ideas are applicable in many problems involving 
mixtures of distributions. 

Generally, and more formally, we shall be concerned with prob¬ 
lems arising in the following manner: an observation x is made on a 
random variable X whose distribution depends on the parameter A. 
Our task is to make a decision <5(x) about the value of A. Typically the 
decision may be the calculation of a point estimate of A, or it may be 
a choice between two hypothetical values of A. The dependence of the 
decision on x is indicated by using the symbol <5(x), which is said to 
represent the decision function. In practice one commonly has a 
number m > 1 of independent observations on X, rather than just the 
one value x, and obviously our theory has to allow for such multiple 
observations. We may also have to deal with problems involving 
more than one parameter. But to begin we shall avoid the notational, 
computational and other complications that arise with these 
generalizations. 

In the pure Bayesian approach to the decision problem the 
parameter value itself is regarded as a realization of a random 
variable A with distribution function G(A). The distribution of A 
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is called the prior distribution. The probabilities defined by G(A) 
are not necessarily interpretable in terms of relative frequencies. A 
fundamental problem in the pure Bayes approach is the specification 
of G. The Bayes solution of a decision problem generally depends on 
G, and the Bayes decision function is denoted by <5 G (x) to show this 
dependence. An outline of the pure Bayes approach is given in 
section 1.2, with applications to examples which are used repeatedly 
later on. 

Decision functions derived without appeal to the notion of a prior 
distribution will be described as conventional, non-Bayes or clas¬ 
sical. A great classical literature exists. The various criteria for 
obtaining non-Bayes decision rules, and related special techniques, 
are well documented and will be assumed known. It will be seen that 
rules derived by the likelihood principle are prominent among the 
non-Bayes rules. 

In the empirical Bayes approach the existence of a prior distri¬ 
bution is postulated, but it is taken to be susceptible to a frequency 
interpretation. Further, the availability of previous data, suitable for 
estimation of the prior distribution G is assumed. The mathematical 
derivations associated with the Bayes method are used to obtain a 
decision function S G (x), generally dependent on G, but then <5 G (x) is 
replaced by an estimate based on the previous data. Such an 
estimated <5 G (x) is called an empirical Bayes decision rule. 


1.2 An introduction to Bayes techniques 

Since the EB approach uses the techniques and results of the Bayes 
approach some of the standard results are reviewed in this and 
following sections. Applications of the EB methods described in this 
monograph are envisaged as occurring in repetitive experimentation 
with parameters varying from experiment to experiment. The notion 
of expected loss therefore seems rather natural in this context, hence 
our introduction to Bayes methods is based on the notion of a loss 
function. 

Let a loss, L(S (x), 2) > 0, be incurred when the parameter value is A 
and a decision <5(x) is made. For example, if <5(x) is a point estimate of 
X it is common to put L(S (x), A) = (<5(x) — A) 2 . Or, if d(x) is an interval 
estimate one may put L(<5(x), A) = 0 or 1 according as the interval 
does or does not contain A. 
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The expected loss for fixed A is the risk 




L{8(x), A} / (x | X) dx, 


( 1 . 2 . 1 ) 


where /(x|A) is the probability density function (p.d.f.) of X. 
Modification of (1.2.1) for discrete X is obvious. The selection of a 
decision function now becomes a matter of choosing a <5(x) whose 
R a (A) has acceptable properties; see, for example, Ferguson (1967, 
section 1.6). Clearly, the smaller R,,(A) for any X, the better, but it is 
trivially true that there is generally no <5*(x) such that R a .(A) sg R a (A) 
for all X and every 8. Thus there is no uniformly best 8, and an 
additional criterion for selecting a 8 has to be invoked. One of these 
is provided in the Bayes approach, in which the goodness of a 8 is 
judged by the overall expected loss, or the average risk, with respect 
to the prior distribution G(A). It is given by 


W(8) = 


I 


L{5(x),X}f( 


x\X)dxdG(X). 


( 1 . 2 . 2 ) 


Now 8 is chosen so as to minimize W. The 8 which does minimize W 
will depend on G, and it is denoted by 8 G to indicate the dependence. 
We shall call W{8 C ) the Bayes risk, but different terminology is also 
in use. Some authors refer to IT((5 G ) as the Bayes envelope functional. 

The actual determination of 8 G can proceed in principle by noting 
in an obvious abbreviated notation that W = E(L) = EE(L\x) so 
that we choose 8 for every x so as to minimize 


£(L|x) = 


L(8,X)f(x\X)dG(X) 


f(x\X)dG{X). 


(1.2.3) 


The details of such calculations depend on L. Most of this book is 
devoted to problems of point estimation and decision between two 
hypotheses. In the former case we usually take L = (<5 — A) 2 , in the 
latter L = 0 or 1 according as the right or wrong decision is made. 
The relevant calculations for these two cases are taken up in 
somewhat more detail in following sections. 


1.3 Bayes point estimation: one parameter 

Let <5(x) be any point estimate of A. If S G is the Bayes point estimate we 
have 


W(8)ZW(8 g ) 


(1.3.1) 
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by definition. Now, with L(5, X) = (<5 — X) 2 , 
W(d) = [ (W) - X} 2 f(x\X)dG(X)dx 


II 


= W(S G ) + (<5(x)-<5 G (x)} 2 /(x|A)dG(A)dx 


(1.3.2) 


+ 2 


j j (<5(x) - ,5 e (x)} {M*) - X)f(x\X)dG{X)dx. 


Condition (1.3.1) will be satisfied if the third term in (1.3.2) is zero, 
which can be arranged by putting 


{8 G (x)-X}f(x\X)dG(X) = 0 


for every x. This gives 

s , ^jXf(x\X)dG(X) 
dG(X) ~ \f(x\X)dG(X)- 


(1.3.3) 


Thus <5 G (x) is the mean of the posterior distribution of A for 
given X = x. The same result is readily obtained from (1.2.3) by 
differentiation with respect to 5. 

The following are some notes arising from the derivation of <5 G (x): 


1. In the denominator of the right-hand side of (1.3.3) we have the 
marginal p.d.f. of X, 


/«(*) = 


I 


f{x\X)dG(X). 


(1.3.4) 


The corresponding marginal distribution function is F G (x) and 
sometimes it will be convenient to refer to the marginal random 
variable (r.v.) whose cumulative distribution function (c.d.f.) is 
F g (x) as X G . 

2. In the joint distribution of X G and A, <5 G (x) is the regression of A 
on X G . 

3. The marginal distribution of X G is also called a compound or a 
mixed distribution. 

4. With 8 g (x) given by (1.3.3) relation (1.3.2) becomes 

W{6) = W{S G ) + | (5(x) - 5 G (x)} 2 f G (x)dx, 


which is often useful for calculating W(S). 
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Example 1.3.1 Let F(x\l) be the N(X,a 2 ) c.d.f. and G(A) the 
N (p G , Oq) c.d.f. Then, according to standard theory for the bivariate 
normal distribution, the distribution of X G is N(p G ,a 2 4- <j§), the 
joint distribution of A, and X G is bivariate normal with correlation 
coefficient p such that p 2 = a G /{a 2 + g§) and 

<M*) = (*/<7 2 + + V<7g); (1-3.5) 

see, for example Lindley (1965, p. 2). Also, 

W(5 g )=\/(\Ig 2 + \IgI). (1.3.6) 

Example 1.3.2 Let p{x| A) be the Poisson probability distribution 
p(x\A) = e~ x 2.*/x!, x = 0,1,2,... 
and G(A) a gamma c.d.f with p.d.f. 

a, P>0. 

Then 

M*) = (/* + *)/(*+!) (1-3-7) 

and 

W(8 a ) = p/{x(« + 1)}. (1.3.8) 

In Examples 1.3.1 and 1.3.2 the type of prior distribution is given. 
Such knowledge will rarely be available in applications of EB 
methods, hence the following examples, in which the form of G(A) is 
not specified, are of particular interest. They have played an 
important role in the literature on EB methods. 


Example 1.3.3 The Poisson case as in Example 1.3.2: 

<5 G (x) = (l/x!) ( k x+1 e~ 2 dG{l) /{(1/x!) [ k x e~ 2 dG(X) 


= (x+ l)p c (x + l)/p G (x), 

where p G (x) is a mixed Poisson probability distribution. 
Example 1.3.4 The geometric distribution: 

p(x|A) = (l -1)2*, x = 0,1,2,...; 0<A<1. 


4 


<5 G (x) = (l-A)A* +1 dG(A) 




A)A*dG(A) 


(1.3.9) 


= p G (x+ l)/p G (x). 
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Example 1.3.5 The negative binomial distribution: 

P(X | A) = (1 - XY{r(p + x)ir{p)}X*lx\, x = 0 , 1 , 2 ,...; 


with p known. By steps like those in the previous examples 


5 G (X): 


_ f x + l \ p G (x+ 1) 
p + xj p G (x) 
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X,p> 0 


In the preceding three examples we found that 

S G (x) = C(x)p G (x + l)/p G (x ) (1.3.10) 

where C(x) is a known function of x. The fact that <5 0 (x) is expressed 
in terms of marginal probabilities of the r.v. X G turns out to be useful 
in EB estimation, as was pointed out by Robbins (1955). In fact, a 
result like (1.3.10) holds for the members of the exponential family of 
discrete probability distributions which can be put in the form 

p(xU) = A*exp{C(A)+ T(x)} (1.3.11) 

for which 

5 c (x) = exp( V (x) - V(x + l)}p c (x + l)/p G (x). (1.3.12) 

A similar result is obtainable for a continuous r.v. Y with 
distribution in the exponential family of p.d.f.s 

f(y\p) = exp {A(p) + B(p)W(y) + U(y)}. (1.3.13) 

Making the transformations X = W(Y) and X = (exp(B(p)} the p.d.f. 
of X is 


f(x\X) = 2 x exp{C(A) + K (x)}; (1.3.14) 

note also (1.3.11). Hence 5 G (x) can be written as in (1.3.12) with p 
replaced by /. Although this produces an expression for S G (x) in 
terms of marginal probabilities the parametrization may be some¬ 
what unnatural, as the following example shows. 

Example 1.3.6 If the distribution of X is N(ji,o 2 ), a 2 known, the 
Bayes estimator of X = e"^ 2 is 

<Mx) = e ix+i/2)l<r2 f G (x -f 1)// G (x), 

a result of the type of (1.3.12). However, in applications of the normal 
distribution interest will usually be centred on p rather than 
exp (p/a 2 ). 
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In the case of continuous distributions it is more natural to use 
differentiation rather than the differencing process leading to (1.3.12) 
to obtain a similar result. An illustration is given in Example 1.3.7. 


Example 1.3.7 The normal distribution as in Example 1.3.6: 

Since In f(x |ju)= — In {o) — -^(x — p) 2 

1 8f(x\p) = (x-p) 

f(x\p) dx a 2 


or 




= x + a 


2 


_1 _ 

f(x\p) 


df{x Im) 
dx 


Appropriate substitutions in (1.3.3) give the Bayes point estimate of p 
as 


S G (x) = x + (r 2 f' G (x)/f G (x); (1.3.15) 


see Miyasawa (1961). 


In Example 1.3.6 the Bayes estimate of exp(p/a 2 ) is expressed in 
terms of marginal probability densities of X G , and in Example 1.3.7 
the Bayes estimate of p is expressed in terms of the marginal p.d.f. 
and its derivative. As in the case of discrete X this type of expression 
of Bayes estimates is useful in connection with EB estimation, and 
will be taken up in more detail in section 3.3. 


1.4 Bayes decisions between k simple hypotheses 

We consider first the case of two simple hypotheses H^ \ k = k 2 and 
H 2 : k = k 2 , ki < k 2 . The prior probabilities are P(A = kj) = Op j = 
1,2 with 0 i + 0 2 = 1. Thus G(k) is a step function with jumps at k x 
and k 2 of sizes 0 X and 0 2 respectively. The decision function <5(x) is 
defined in terms of a partition of the sample space into two regions 
A x and A 2 such that Hj is accepted when xeAj,j= 1,2. The most 
common definition of the loss function, L{6(x), k}, in this context is 
to let L = 0 when the correct decision is made, and 1 otherwise. Then 

f(x\k l )dx + 0 2 \ f(x\k 2 )dx , 

J A2 J Ai 
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with the obvious modifications for discrete X. We see that W(S) is 
the overall expected proportion of wrong decisions. 

To minimize W(5) we define A x and A 2 such that xeA x when 
e 2 f(x\A 2 )<8 1 f(x\X 1 ). Hence the Bayes decision rule can be stated 
as follows: choose H x if the posterior probability of H x exceeds 1/2. 
Equivalently choose whichever of H x and H 2 has the greater 
posterior probability. 

When /(x|A) is such that f(x\X 1 )/f(x\X 2 ) is monotonic in x it 
follows immediately that A x comprises all values of x < £ 0 , where 
x = £ g is the solution of 

0 2 f(x\X 2 ) = 6 1 f(x\X l ), (1.4.1) 

a suitable convention being adopted when X is discrete. 

Example 1.4.1 Suppose that the distribution of X is N(X, 1). Then 
(1.4.1) becomes 

^ = exp{-(2 2 -A 1 )(2x-A 1 -A 2 )/2}, 
the right-hand side is monotonic in x, and 

(1A2) 

In principle the case of 2 < k < oo simple hypotheses is not 
different from that of two simple hypotheses. The hypotheses are Hy 
A = Ay j = 1,2,..., k, A x < X 2 • • ■ < A k , and the respective prior proba¬ 
bilities are 9j, j = 1,2, ...,k, with 6 x + 9 2 + ■■■ + 6 k = l. Defining 5 
such that Aj is the region of acceptance of Hj, j — 1,2,..., k, 

W(8) = £ 9j\ f(x\Xj)dx. (1.4.3) 

ifi= 1 J Ai 

Following the argument used for deriving the Bayes rule in the case 
k = 2, W(6) is minimized by choosing for every x the H which has the 
largest posterior probability. 


1.5 Bayes decisions between two composite hypotheses 

Let B { and B 2 represent a partition of the parameter space, so that 
we choose between Hy, XeB x and H 2 ; XeB 2 . The sample space is 
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partitioned into A 2 and A 2 so that Hj is selected when xeAjJ =1,2. 
Then, with the 0-1 loss structure as in section 1.4, 


W(S) = 


f{x\k)dG(k)dx + 


teBi 


J j 

J xeAi J AeBi 


f{x\k)dG(k). (1.5.1) 


Arguing as in section 1.4, W(S) is minimized by assigning a point x in 
the sample space to A 2 if 



f(x\k)dG(k) < 



/(x|A)dG(A). 


(1.5.2) 


If equality can occur in (1.5.2) with non-zero probability, as with 
discrete X, a suitable convention is adopted. 

A special case of some interest is when k is a location parameter 
and we let H 1 : k<k 0 , H 2 : k'z k 0 . Then (1.5.2) becomes 


* 00 

Ao 


f(x\k)dG(k)< 


‘ Ao 


f(x\k)dG(k). 


An interpretation of this result is: choose H t if the posterior median 
of A is smaller than k 0 , otherwise choose H 2 . 

An alternative loss structure for the special case H x : k< k 0 , H 2 : 
k > A 0 has been considered by Samuel (1963) and Robbins (1964). Let 


Then 


Loss = 


f 

0 

0 

k 

k kg 


for k< k 0 , xeAi 
for k ^ k 0 , xeA 2 
for k < k 0 , xeA 2 
for k^k 0 , xeA t . 


VT(<5) 


= f f°° (A — A 0 

J At J Ao 

+ f f° (Ao 

J At J -CO 


)f(x\k)dG(k)dx 
k)f(x\k)dG(k)dx, 


and it is minimized by assigning x to A t if 


\kf(x\k)dG(k) 
J/(x|A)dG(A) 0 


i.e. if the posterior mean is < k 0 . 

The arguments leading to (1.5.2) are readily extended to k>2 
hypotheses; no further detail will be given here. 
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1.6 Bayes estimation of vector parameters 

Much of what has gone before can be generalized quite easily when 
X is replaced by the vector r.v. X = (X, ■Af / >) T and A by A = 
(A 1 ,A 2 ,...,Aj r . Standard examples are the multivariate normal 
and the multinominal distributions. Letting L{d(x), A} be the loss in 
making decision <5(x) about A, the discussion of section 1.2 carries 
over with hardly any change, and formally the Bayes decision rule 
S a can be obtained by using (1.2.3) on replacing x by x, etc. Of 
course, G is now a /c-variate distribution. 

For some problems of point estimation a natural generalization of 
squared error loss is 

L(8,X) = (8 — A) T A(d — A) 


where A is a positive definite matrix. The Bayes point estimate d G (x) 
can be obtained from the vector version of (1.3.2), 


W(8) = W(S G ) + j| (<5(x) - 8 G (x)} T A (*(x) 

-S G (x)}f(x\l)dG(X)dx 

+ 2||{ < 5(x)- < 5 G (x)} T A{6 G (x)-A}/(x|A)dG(A)dx. 


The third term in the right-hand side of the above expression can be 
made equal to zero, thus ensuring IT(6)> VT(6 G ), by letting the /'th 
element, <5 Gj (x) of <S G (x) be 


* fA,/(x|A)dG(A) 

aW f/(*|A)rfG(A) ’ 

the posterior mean of A,. Remarkably this Bayes point estimate is 
not dependent on A. Of course, the value of W{ A) for any 8, 
including 8 a , will depend on A. 

If a single function w(A) of the parameters is to be estimated 
subject to quadratic loss its Bayes point estimate is readily seen to be 
the posterior mean of w(A). 


Example 1.6.1 Suppose that the joint distribution of X | A is k- 
variate normal with known covariance matrix £ and that the prior 
distribution of A is /c-variate normal with mean fi G and covariance 
matrix £ G . Then the posterior distribution of A is fc-variate normal 



12 


BAYES AND EMPIRICAL BAYES METHODS 


with mean vector 

p{l :- 1 + Eo 1 } _1 {E _1 * + E5 , M 

and covariance matrix {£ _1 + Eg 1 } -1 . 

The marginal distribution of X G is /c-variate normal with mean 
vector p G and covariance matrix E + E G . 

1.7 Bayes decision and multiple independent observations 

Throughout, the discussion so far has been in terms of a single 
observation x or x being made on X or X. Generalization to the 
case of m independent observations, on x or x, is straightforward in 
principle. Concentrating for now on the univariate single parameter 
case, suppose that m independent observations are made on X. 
Then f(x |A) in (1.2.3) is replaced by the likelihood nr=i/( x iU)> 
the method otherwise remaining unchanged. If a one-dimensional 
sufficient statistic r(x 1 ,...,x m ) exists we have n"=i/( x iU) = 
g(x 1 ---x m )h(t\X). Therefore on substitution in (1.2.3) the problem is 
essentially reduced to the one-sample case. Most of the important 
distributions taken as special cases in this monograph admit one- 
or low-dimensional sufficient statistics. 

Example 1.7.1 Let X be a Poisson r.v. with mean A. Then 

fiiKx.lAJ-e-^ndW) 

i= 1 i=l 

and (1.2.3) assumes the form given by a single observation y = 
on a Poisson r.v. with mean mA. 

Similar simplifications can be made for univariate and multivari¬ 
ate normal distributions which will play an important role in the 
sequel. 

Where low-dimensional sufficient statistics exist the form of the 
factor h(t|A) in the factorization of the likelihood is exploited to 
generate the class of natural conjugate prior distributions; see for 
example de Groot (1970, p. 159). 

If low-dimensional sufficient statistics do not exist it is possible to 
reduce data by calculating maximum likelihood (ML) estimates, and 
then, using exact or approximate sampling distributions of these 
estimates to apply the Bayes techniques in a straightforward manner. 
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Strictly, this does not produce a Bayes rule, but the asymptotic 
sufficiency of MLEs, for which the reader may refer to Cox and 
Hinkley (1974, p. 307), provides some justification for such a 
procedure. 


1.8 Empirical Bayes methods 

Empirical Bayes methods rely on the existence of a prior distribution 
G(A) which can be given a frequency interpretation, and which can be 
estimated using suitable observations. Thus the EB approach can be 
essentially non-Bayesian in the sense of not involving subjective 
probabilities. In the simplest case the EB sampling scheme is as 
follows: a current observation x is made when the parameter value is 
A, a realization of A, and x is to be used in a decision about A. At the 
time of making the current observation there are available past 
observations x 1 ,x 2 ,...,x„ obtained with independent past realiz¬ 
ations A x , A 2 ,.. ■, A„ of A. In this scheme every x, is a realization of X h 
and the AL's are mutually independent. It is useful to represent the 
EB sampling scheme as in (1.8.1). 

EB sampling scheme 


Previous stages Current stage 
unknowns: A 1 -A„ A 

observables: x 1 -x„ x 


The words ‘current’ and ‘past’ are not necessarily to be taken in a 
strictly temporal sense. Usually it is assumed that the actual values 
A x , A 2 A„ never become known. 

The possibility of obtaining an estimate of G arises through the 
fact that x 1 ,x 2 ,...,x n may be regarded as an independent sequence 
of observations on X G whose distribution function F G is given in 
(1.3.4) with / replaced by F. The empirical c.d.f. of these x-values, 
F„(x), is an estimate of F G (x) such that F„(x)-»F G (x) in probability 
(P), as n -* oo for every x. This suggests that it might be possible to 
find a c.d.f. G(A) such that 


'4 


F„(x)^ I F(x| A)dG(A) 


with the property that G(A)-»G(A), (P), for all A as n-» oo. 
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If such an empirical G(A) can be found, substituting it for G(A) in 
the derivation of a Bayes decision rule will yield an empirical Bayes 
rule, ^(xj, x 2 ,..., x„; x). This notation emphasizes that the EB rule 
will generally depend on all past x-values as well as the current x. We 
may regard it in a broad sense as an estimate of the Bayes rule. In the 
case of point estimation, <5„{x 1 ,x 2 ,...,x„;x) is a point estimate of 
<5 G (x). EB rules need not necessarily be obtained by directly exploit¬ 
ing the relation F G (x) = j F(x| A)dG(A) to obtain an estimate of G. In 
section 1.9 we discuss an example which illustrates this and also 
motivates several questions that may be asked about EB methods. 

To conclude this introduction to EB methods we draw attention 
to a somewhat broader interpretation of the term ‘empirical Bayes’ 
than is implied by the typical EB sampling scheme described above. 
Suppose that a Bayes decision rule involves a parameter co of a prior 
distribution. If co is replaced by any estimate derived from observed 
data we may refer to the resulting rule as an empirical Bayes rule. An 
example of this sort occurs in the EB analysis of contingency tables 
proposed by Laird (1978). In this example the parameter a 2 of a prior 
distribution is estimated, but the sampling scheme is not exactly like 
the classical EB scheme. 


1.9 An example: EB estimation in the Poisson case 

In Example 1.3.3, (1.3.9) gives an expression for the Bayes point 
estimate of A in terms of the marginal probabilities p G (x + 1) and 
p G (x). Now suppose that among the past observations there are f„{x) 
having the value x, x = 0,1,2,.... Since x x ,x 2 ,...,x„ are independ¬ 
ent realizations of X G with probability distribution (p.d.) p G (x) we can 
estimate p G (x) by f„(x)/n. Including the current x we have [1 + /„(x)] 
observations with the value x, out of a total of n + 1 observations, 
and /„(x + 1) with value x + 1. Therefore we have an estimate of the 
Bayes estimate given by 

<5„(*i •••x„;x) = (x+ l)/„(x + 1)/[1 +/„(x)]. (1.9.1) 

Several comments on this estimate are in order: 

1. Explicit estimation of G is not needed to obtain <5„. Only estimates 
of the marginal X distribution are used. Other such EB estimates 
will be studied in more detail in section 3.4; they are called simple 
EB estimates. 
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2. It is not a smooth function of x. In a finite sample /„(x) will clearly 
be 0 for x large enough, and irregularities in the observed values 
/ 0 (x), j\ (x),... will typically produce a graph of S„ as shown in 
Fig. 1.1. 

3. Figure 1.1 also shows graphs of the maximum likelihood estimate 
T = x, and the Bayes estimate <5 G (x) for a typical example. It 
seems clear that a smoothed version of d n will be closer to 3 G , and 
therefore better in some reasonable sense. Of course the Bayes 
estimate will not necessarily be a straight line function of x as 
shown in Fig. 1.2, but in the Poisson case, and many others, it is 
always monotonic in x. This follows by noting that 

<5 g (x+1)-<5 c (x) 

{jA x + 2 e- x dG(A)}{$A x e-*dGW}-{jt* +1 e- x dG(A)} 2 3 
{jA x+ 1 e- x dGW}{$t*e- x dGW} 

= M(x)[£(A?)-{£(AJ} 2 ] 

where A x is a r.v. with p.d.f. cc X x e~ x dG(X) and M(x)> 0. 

Ways of obtaining smooth EB estimators will be given in 
Chapter 3. 


16 


BAYES AND EMPIRICAL BAYES METHODS 



Fig. 1.2 


4. As n oo, f„(x)/n -> p G (x) in probability, for every x. Consequent¬ 
ly <5„(x! ••• x„;x)-> d G (x), (P), for every x. Thus <5„ may be said to be 
asymptotically optimal in this sense. There are other ways of 
viewing the question of asymptotic optimality. For example, the 
‘goodness’ of a particular 8„ could be measured by W(6„\ but with 
respect to past observations W(S„) is a random variable. Hence 
the goodness of the method of EB estimation represented by d„ 
could be measured by E n W(8„), where E„ indicates expectation 
with respect to past samples of size n. We could say that <5„ is 
asymptotically optimal if E„ W(8„) -> W (<5 C ) as n -*■ oo. Asymptotic 
optimality will be discussed further in section 1.10 and elsewhere. 

1.10 The goodness of EB procedures 

The Bayes decision rule, 8 a (x), is defined as that <5(x) which 
minimizes W(8) so that W{8 G ) ^ W(6) for all S. Now, for any 8 the 
value of W ((5) is a measure of its goodness and in the Bayes sense <5 G 
is best, or optimal. If <5„ is an EB rule derived from a particular set of 
past observations, W{8 n ) is a measure of its goodness. With respect 
to the past observations W(8 n ) is, of course, a random variable. 
Therefore an assessment of the overall goodness of an EB method 
should pay attention to the distribution of W(S n ). 

A natural measure of performance of an EB method, in the light of 
the preceding discussion, is the expectation of W{8 n ) with respect to 
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previous samples of size n, i.e. E„ W(6„). We could then say that 5 n is 
asymptotically optimal (a.o.) if E„ W(<5„) W (<5 G ) as n -»oo (Robbins, 
1964). Even if 5„ is a.o., E„W(6„) may be considerably greater than 
W(8 a ) for finite n values, and it may be greater than W(T) where T is 
a non-Bayes rule. For example, in point estimation T may be a 
maximum likelihood estimator, and S„ would not necessarily be 
preferred to it unless E„IV(S„) < W(T). Typically the relation be¬ 
tween E„fV(S„), W{T), W(5 g ) can be depicted as in Fig. 1.2. Usually 
there will be a value of n, say n T , such that W(T) < E n W(6 n ) for 
n < n T . The asymptotic optimality of EB procedures has been studied 
in considerable detail by several authors, and more will be said about 
it in section 3.2 and elsewhere. 

A point of notation: it will often be convenient to refer to the W 
values of the Bayes, EB and other rules as W (Bayes), W (EB), etc. 

In choosing whether to use an EB method in preference to a non- 
Bayes method, criteria other than E n W{6„) could be used. For 
example, if one were concerned that the realized 3 n should be better 
than T one may focus attention on P„{W{3„) < W(T)}, where P„ 
indicates a probability calculated with reference to previous samples 
of size n. Another definition of asymptotic optimality is a.o.(P): 
W(«5„)- W{S G ), (P), as n-> oo. The property E n W(S n )-> W(d G ) as 
oo can be called a.o.(£). With some restrictions a .o.(P) will imply 
a.o.(E). Also a.o.(P) will imply P n {W(3 B ) < W(T)} > 1 — e for n large 
enough. In practice the choice between 5„ and T has to be made on 
the basis of known results for E„W{d„) or P n {W(6„)< W(T)} for 
cases similar to the problem in hand, or by trying to estimate 
these quantities. 

1.11 Smooth EB estimates 

The monotonicity of the Bayes estimator in the Poisson case has 
been noted in remark (3), section 1.9. Many other Bayes estimators 
have this property, indicating that EB estimators should also have a 
minimal smoothness. Smooth EB estimates can be obtained in at 
least two ways: (a) by smoothing a simple EB estimator obtained as 
indicated in the example of section 1.9; (b) by exploiting the mono¬ 
tonicity of the Bayes estimator and replacing G by an estimated G. 
Regarding (a), the smoothing can be by fitting a straight line or some 
other suitable curve through the observed non-smooth graph of a 
simple S n . In some instances fitting a particular functional form, like 
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a straight line, is tantamount to assuming a particular parametric 
form for G. Direct smoothing of 5 n can also be done by some 
monotonic regression technique. Smoothing according to (b) re¬ 
quires estimation of G. Here we have many possibilities, the simplest 
being the assumption of a parametric form of distribution for G, in 
which case standard techniques of estimation, such as maximum 
likelihood, can be used. 

If the form of G is not known a smooth estimator can still be 
obtained by taking G to belong to a certain class of distributions. For 
example, G may be taken to be a finite Step function. The EB 
estimator obtained in this way will not generally be a.o. unless the 
assumed class contains the actual G. However, it may be that a 
member of the class can be a good approximation to G, in which case 
the derived smooth estimator, although it may be described as an 
approximate EB estimator, may still have satisfactory performance. 

The use of an approximation to the true G is discussed further in 
section 1.12 and later. 


1,12 Approximate Bayes and empirical Bayes methods 


1.12.1 Linear Bayes estimators 

Most of the discussion of this section will be in terms of point 
estimation although some of the ideas do carry over to other types “of 
decisions. Any point estimator derived using the prior distribution to 
find a best estimator within a certain class can be called an 
approximate Bayes estimator if it is not the actual Bayes estimator. 
We consider first linear Bayes estimators (Hartigan (1969), Griffin 
and Krutchkoff (1971)). 

The simplest case is when there is just one observation xonX 
when the parameter value is L We consider estimates of the form 

5(co 0 ,co 1 ;x) = S(to;x) — co 0 + co 1 x 

where co 0 and o> l are chosen to minimize W {<5(ct»; jc) }. The termi¬ 
nology ‘linear Bayes’ is explained by the form of <5(«;x) and the fact 
that G(2) plays a role in the determination of ty 0 and ajj. Now 


W{d(fo,x)} 


(co 0 + co 1 x — X) 2 f(x\X)dG(2) (1.12.1) 


and it is easily minimized by differentiation w.r.t. co 0 and co lt the 
Bayes values, <o G being obtained as the solutions of the following 
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equations in aj^aq: 


r 

1 jE(X\X)dG(X)l 

Oi 0 

(_| E(X\X)dG(X) j’£(2f 2 |A)dG(A)J 



j'AdG(A) 



J'A£(A'|A)dG(A) ' 



( 1 . 12 . 2 ) 


Example 1.12.1 Suppose that the distribution of X is Poisson with 
mean A. Then E(X |A) = A and E(X 2 \X) = A 2 + A. Substituting in 
(1.12.2) gives 

co 0G = (£(A)} 2 /{var(A) + £(A)}, m 1G = var(A)/{var(A) + £(A)}. 

(1.12.3) 

If the prior distribution is a T distribution as in Example 1.3.2, 
£(A) = P/a. 2 and substitution in (1.12.3) gives the result 

^(e) G ;x) = co 0G + a> lG x = (P + x)/(a + 1) = (5 G (x) 

as in (1.3.7). In general, of course, <5(co G ;x) # <5 G (x). 

One of the main reasons for introducing linear Bayes estimators 
here is that they are easily adapted to empirical Bayes estimation. 
Recall that for the EB estimation we require in general an estimate of 
G(A). In formula (1.12.2), 

• 

E(X r \X)dG(X) = E(X' g ), r= 1,2, 

• 

which are obviously estimated quite readily using the observed 
marginal distribution of past observations. If E(X | A) = A as in 
Example 1.12.1 only the first two moments of the distribution of A 
have to be estimated. 

When m > 1 observations on X are made, linear Bayes estimation 
can be extended to letting the estimator be a linear function of order 
statistics (Lwin, 1976). For more details see section 3.8. 


1.12.2 Approximations to the prior distribution 

Suppose that G* is an approximation to G. Then <5 C * is an 
approximation to S G . For the purpose of empirical Bayes inference 
the sense in which G* might be an approximation to G is that G* is 
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that member of a certain class of distributions for which the distance 
D {F g , F g .} is minimized. The distance measure D is yet to be chosen 
and may depend on specific applications. This definition of approxi¬ 
mation is motivated by the fact that F 0 is observable in the EB 
context thus making the determination of G* feasible, at least in the 
sense that it can be estimated statistically. 

The classes of distributions that may be considered for G* include 

1. the natural conjugate priors; 

2. finite step functions. 

The choice of distance measure D, determination of G* and the 
goodness of <$ G * as an approximation to S G are topics to be taken 
further in Chapters 2 and 3 and elsewhere, but a simple example 
follows. 


Example 1.12.2 Suppose that X = 1V(A, 1) (where X = F: the distri¬ 
bution of X is F) with A = N(0, 1.0) and that G*(A) is a step function 
with jumps at A 1: , /l 2 = 0, A 3 each of size 1/3 and that A t = - A 3 . One 
way of determining A x is by equating the variances of G* and G. This 
gives k 3 = —k 1 = 1.2, 1F(<5 g .) = 0.5, to compare with 1T(^ G ) = 0.5 
and W(T = x) = 1.0. 

1.12.3 Using non-sufficient statistics 

Suppose that m independent observations are made on the one¬ 
dimensional r.v. X whose distribution depends on the single para¬ 
meter A. If a one-dimensional sufficient statistic t(xj ••• x m ) exists the 
Bayes decision rule reduces to a function S G {t) of t. When a sufficient 
t does not exist calculations involving likelihoods nr=i/( x ;W can 
become complicated, especially in the EB framework, and it may be 
contemplated to effect a reduction by basing the decision on an 
estimate x of A. 

To obtain a decision rule based on x the p.d.f. f(x\ A) is replaced by 
the p.d.f. h(ic|A) of x in the formulae for obtaining Bayes estimates. 
Even if the exact distribution of x is used the resulting decision rule is 
not the actual Bayes rule; sometimes it may be possible only to 
obtain an approximation for h(x | A). The advantage of this approach, 
especially in EB decisions, is that estimation of an approximation to 
G can be considerably simplified. 
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1.13 Concomitant variables 

In many practical cases there will be concomitant information about 
the parameter values. Specifically, recall the EB sampling scheme 
where we have observations (jc 1 ,jc 2 ,...,x„) when the parameter 
values are (A t , X 2 ,.. .,A„). Every x, is usually thought of as an 
estimate of the corresponding X t . Now it may happen that we also 
have associated with every x, an observation c, on a concomitant 
variable C. Every c, is not necessarily an estimate of X h but C and A 
may not be independent, so that taking account of the observed c 
should improve the estimate of A. However, the emphasis is still on 
estimating individual X values, and not on exploring the relation¬ 
ship between A and C, for instance, through the regression of 
A on C. More details of EB analysis in this case are given in 
section 4.7. 

1.14 Competitors of EB methods 

A brief discussion of developments in EB methods has been given in 
section 1.8. The general idea of these developments is that the EB 
technique is an attractive compromise between the conventional 
non-Bayes approach and the fully specified Bayesian approach for 
the analysis of historical data arising in a sampling scheme which can 
be represented as in (1.8.1). 

There are other ways of utilizing all the available information 
given by an EB sampling in an ‘optimal’ way. They produce 
‘competitors’ to EB methods. Among such competitors are: (1) the 
compound decision (CD) approach initiated by Robbins (1951) in 
the hypothesis testing framework and by Stein (1955) for estimation; 
(2) the full Bayesian (FB) multiparameter approach initiated by 
Lindley (1962, 1971); and (3) a modified likelihood approach first 
employed by Henderson et al. (1959). All of these approaches treat 
the problem of the EB scheme as one of simultaneous decisions 
about all unknown A’s in all stages. Brief introductions to these 
approaches now follow. 

1.14.1 Compound decision theory 

Consider the problem of estimating the unknown A of a N(X,< r 2 ) 
distribution discussed in Example 1.3.1. Suppose that we consider 
the problem of estimating all the elements of A = (A 1 ,...,A n+1 ) T 
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simultaneously using the data given by x = (xj,... ,x n+1 ) T where 
A„ +1 and x„ +1 respectively stand for X and x of the current stage in 
EB scheme (1.8.1). For notational simplicity we let k = n + 1 and 
consider the fc-parameter problem of estimating the mean vector X 
using the data x from a k-variate normal distribution with cova¬ 
riance matrix a 2 I kxk where I kxk is a k x k identity matrix. 

Let X be an estimate to be sought such that the total quadratic 
loss 


L(X,X)= X (!,' - A,) 2 


(1.14.1) 


is optimized in some sense. For example, one might minimize the 
expected value of L. Suppose also that the estimates of X are 
restricted to the class defined by 

X i = a 0 + a 1 x i , i=l,...,k. 

Then the optimal estimate X* in this class, which minimizes the risk 

W{X,X)= T-- f | X («o + a 1 Xf -A,) 2 j n /(*<IA<) dXx-- dx k , 

(1.14.2) 

is given by 

Qq = A u j A 
a i = + °x) 

where 


A- XV* 


»— 1 
k 


X^-W- 


i= 1 


Hence the optimal linear estimate of X is X*, elements given by 

If = X+(1 -c A )(x,-X), i = 1,2,..., k, (1.14.3) 

where 


Cx= 1 - Oj = a 2 /{a 2 + a J). 

The optimal linear estimator X* depends on the unknown quantities 
I and a 2 . However, they can be estimated from the data by noting 
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that 


E(X) = X 

E(S xx ) = (k-l)<7 2 + kal 


(1.14.4) 


where 



Sxx= ZiXi-X) 2 . 

i— 1 


The relationships (1.14.4) provide unbiased estimates of X and <s\ as 


X — x 

6 2 x =s xx /k-(k-l)<j 2 /k 

Alternatively, one can estimate c x directly using the class of estimates 

= va 2 s xx l . (1.14.6) 

Under the assumption of normality of the X t 's, v can be chosen to 
obtain an unbiased estimate of c x , i.e. v = (k — 2). Use of either 
(1.14.5) or (1.14.6) provides an estimator of c x in (1.14.2) and hence an 
estimate of X t based on the optimal linear estimate can be construc¬ 
ted as 

X+ =x + (l -d A )(x f -x). (1.14.7) 

The type of estimate (1.14.7) is very similar to the EB estimate 
derived from (1.12.3). In the literature of compound decision theory, 
such estimates are called compound estimates as contrasted with 
‘simple estimates’ which use only the data of ith stage to estimate 
unknowns at the ith stage. A more detailed discussion of compound 
decision theory is given in Chapter 7. The main aim of the above 
development is to demonstrate that it is not necessary to assume the 
existence of a prior distribution of A’s to obtain estimates of the EB 
type. Such estimates can be constructed in a purely non-Bayes 
setting so long as the loss structure reflects the fact that decisions at 
all stages are considered simultaneously. Using the total loss as in 
(1.14.1) is one way of doing this. Estimates of the form (1.14.7) were 
first given by James and Stein (1960). 


(1.14.5) 
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1.14.2 The full Bayesian approach to the EB scheme 

Originally, the introduction of James-Stein type estimators aroused 
suspicion and confusion, but it stimulated further work aimed at 
clarifying the ‘intriguing aspects’ of estimators of the form (1.14.7). 
Lindley (1962), in the discussion of Stein’s (1962) paper, formulated a 
full Bayesian (FB) approach to the problem of simultaneous estim¬ 
ation of A in the FB scheme. This approach employs the assump¬ 
tions and data of the EB approach according to (1.14.1). In addition, 
the FB approach assumes that the unknown A’s are exchangeable so 
that their joint prior distribution is a mixture of a distribution G by a 
hyper prior distribution. Hence the joint density function of A is 
given by 

gWP)= ^f\g(ti\4>)dP(4>) 

where P is the hyper prior distribution function of the parameter <t> of 
the distribution function G. Thus the joint distribution of(x, A) has a 
density function given by 

a(x,A) = 11 n /(x,|A i )0(A i |0)JdP(<A). 

The FB approach then proceeds to obtain the posterior distribution 
of A given x. Its density function is given by 

h(A|x) = a(x,A) I Ja(x,A)dA. (1.14.8) 

Inference on A then proceeds by looking at various aspects of 
(1.14.8). In particular the posterior mean of A is of the interest. But 
the posterior mode was advocated for two reasons. Firstly, this 
provides a treatment parallel to that of maximizing the likelihood 
function in the non-Bayes approach. Secondly, for more realistic 
cases, the mode seems to be more tractable than the mean. 

We demonstrate below an application of the FB approach to the 
problem considered in Example 1.9.2. Here f(x\k) is N(X,o 2 ) and it 
is assumed that g(k\<j)) is the N(p G ,o G ) density so that the hyper¬ 
parameter vector is <f> = (p G , <j g ). By assuming exchangeability of the 
Aj’s, the prior distribution of A is obtained for a specific choice of the 
hyper-parameter density function p(p G ,o G ). For simplicity, we as¬ 
sume further that o G and <x 2 are known and p G is taken to have a 
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uniform prior distribution on the real line. Thus the prior distri¬ 
bution of k becomes 

aWP)= j*jn °g) 

The posterior distribution of k can be shown to be (Lindley, 1971) a 
multivariate normal distribution with mean vector k* and cova¬ 
riance matrix v*. The ith element of X* is given by 

I* = x + (l_ c((T |))( x ._x) (1.14.9) 



where 


c{a%) = a 1 /(a 1 + o 2 a ). 

The elements of \* depend only on a%, a 2 and k. Thus the FB 
approach in this simple case gives an estimate Xf for A f which is 
similar to the EB estimate. 

In Chapter 7, the FB approach will be discussed in more detail. It 
will be seen that in general the EB and FB approaches do not lead to 
identical results although they both lead to non-simple estimates or 
decision rules. The FB approach has been unified, and extended to a 
number of situations, in Lindley and Smith (1972), Smith (1973) 
and Deely and Lindley (1981) thus providing formidable rival 
procedures for the EB methods. However, the FB approach seems to 
be heavily dependent on the parametric assumptions in general and 
normality in particular. It seems too early to decide which of the EB 
and FB approaches is more practically attractive. 

1.14.3 A modified likelihood approach 

The EB sampling scheme was introduced and the EB type methods 
were proposed with a general assertion that the standard non-Bayes 
methods cannot deal with information in the form of previous data; 
in particular the usual likelihood approach was regarded as lacking 
such a mechanism. However, there has been a development which 
may be termed a modification of the usual likelihood for handling 
data from an EB scheme. This appeared in connection with the 
random effects model in the analysis of variance (Henderson et al., 
1959; Nelder, 1972; Finney, 1974). Briefly, the approach defines the 
likelihood function of the unknown k as the joint probability of 
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(x, 1) in the EB scheme, i.e. the likelihood function is 

L(x,i)=|n/(x i i/ i ) 3 (/,i<A) 

This likelihood function is then maximized with respect to A and <f>. 

In the special case when /(x; | A;) is the AT(A h <j 2 ) density with known 
<r 2 and g(ki\<t>) is the N(n G ,o%) density with known cr G , the 
‘maximum likelihood’ estimate is given by 

Wi) = x 

A f (L) = fi G + (1 - c(<t|))(x,- - fi G ) 
where c(a G ) is given by (1.14.9). 

The above approach poses new questions. Is it justified to use 
L(x,A) in constructing estimates of A and/or 0? What sort of 
properties does this ‘likelihood function’ have? Does it possess a 
local maximum? Note that it is not a likelihood function in the usual 
sense of the word since A is an unobservable random variable. More 
research seems to be necessary to answer these questions satis¬ 
factorily. A possible answer to the first question is suggested in a 
more detailed discussion in Chapter 7. 



CHAPTER 2 


Estimation of the prior 
distribution 


2.1 Introduction 

In this chapter we shall deal with one of the two basic technical 
tools required in implementing the EB approach. As mentioned 
in Chapter 1, EB decision rules can be constructed via two main 
approaches. The first is based on an explicit estimation of the 
unknown prior distribution. The second is based on a method of 
expressing the Bayes estimate or decision rule in terms of functionals 
of G and estimating the Bayes rule itself directly. We shall also see 
that, in general, smooth EB rules obtained by the former approach 
can be ‘better’ than those of the latter. The feasibility of estimating a 
prior distribution G depends on the possibility of finding a distri¬ 
bution function (d.f.) G satisfying the relationship 

ff(x) = |f(x|2)dG(/l) (2.1.1) 

where H(x) and F(x 1A) are given d.f.s. The d.f. H(x) is often called a 
‘mixture’ of F(x| A) type d.f.s, while F(x|A) and G(A) are referred to as 
the kernel and mixing distributions, respectively. The general math¬ 
ematical problem of finding G, given H and F connected by (2.1.1), is 
of interest in its own right, and has received attention from many 
authors. For example, see Medgyessy (1961), Teicher (1961) for early 
studies and Tallis and Chesson (1982) for a more recent study. When 
F(x| A) is a continuous d.f., (2.1.1) is a Fredholm integral equation of 
the first kind. The study of existence and determination of a solution 
of (2.1.1) leads to the concept of identifiability which is characterized 
by the existence of a unique solution of (2.1.1). 

In the empirical Bayes context H{x) becomes F G (x), but an 
additional complication arises since F G (x) is not known exactly. The 
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form of F(x|A) will usually be assumed known. The assumption of 
identifiability is at the heart of the estimation of the mixing 
distribution G. Its practical importance becomes obvious when it is 
seen that estimation procedures for G are not likely to be well defined 
without identifiability. Direct information on F a (x) is supplied only 
by n observations on the r.v. X a . They can be used to construct an 
empirical d.f. F„(x) which is an estimate of F G (x). Thus the problem 
here is twofold. First, it is necessary to examine the question of 
determining G, exactly or approximately, with H(x) known exactly; 
this is the study of identifiability of various specified forms of F(x|A). 
Second, it is necessary to examine the estimation of G, again exactly 
or approximately, with H(x ) replaced by F„(x). 

In the discrete case when the kernel is p(x|A) the corresponding 
analogue of (2.1.1) is obtained by replacing F(x| A) by p(x | A) and H(x) 
by Pg(x), i.e. 

p G (x)= fp(x|A)dG(A). (2.1.2) 


2.2 Identifiability 

An immediate question, assuming that a solution of (2.1.1) exists is 
whether it is unique. The concept of identifiability has been defined 
as follows: G is said to be identifiable in the mixture H, if a unique 
solution, G, of (2.1.1) can be found. Lack of identifiability is not 
uncommon, as is shown by considering the binomial distribution, 


p(x|A) = ( )A*(1-A)" 

v. X i 


x = 0, 


( 2 . 2 . 1 ) 


We see that, for every x, p G (x) is a linear function of the first n 
moments, p! r = j‘A , dG(A), r=l,...,n, of G(A). Consequently any 
other G*(A) with the same first n moments will yield the same mixed 
distribution p G (x). 

In general, identifiability depends on both the kernel distribution 
and the family of distributions to which G is assumed to belong. 
Restrictions on this family can render G identifiable, and as we shall 
see, for certain F(x|A) or p(x|A), G is always identifiable. We shall 
accordingly examine identifiability of G under two broad headings: 
identifying restrictions on G, and special families of kernel 
distributions. 
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In more extensive studies of identifiability by Teicher (1963), Tallis 
(1969) and others, the single parameter A has been replaced by a 
vector parameter. Except to report isolated results on multipara¬ 
meter kernel distributions, we shall confine attention to the one- 
parameter case. Our main concern is with distributions which have 
been studied in the EB field. 


2.3 Parametric G families 

Perhaps the simplest, and also the most severe restriction on the 
family of G distributions, is that G belongs to a certain parametric 
family of distributions. By such a family we mean distributions 
G(A;a ) of known form, depending on a finite-dimensional vector 
parameter a. Common examples are the normal and gamma families. 
In the EB context such a restriction is perhaps somewhat unrealistic, 
but it deserves consideration because the experimenter may have 
good reason for faith in a certain type of prior distribution. For 
example, the ‘naturally conjugate’ prior distribution may be consi¬ 
dered appropriate (cf. Raiffa and Schlaifer, 1961). Technically the 
parametric G families enable one to check identifiability more 
readily as is seen below (cf. Tallis, 1969). 

Let k{x, A; y) be the joint density function of the pair (X, A); y is a 
finite-dimensional parameter. Then we can write 

k(x, A; y) = f(x\A; 0)g(A;a) = h(x; 5)b(A\x;0) 

where a, 0, 8, 6 are finite-dimensional parameter vectors which index 
the respective density functions, h(x,8) is the density function of the 
mixture H(x) and b(A\x;0) is the density function of the posterior 
distribution of A. If 0 can be expressed as a function of 8, 0, say, 

0 = iH8,0) (2.3.1) 

then knowledge of 8 and 0 gives complete determination of g. Since 
we have 


b(A\x-,0) = g(A-,a)f(x\A-,O)/h{x,8), 

this means that a must necessarily be some unique function of 0 and 
8 if the condition (2.3.1) holds, i.e. 

a — <t>(0> 8). (2.3.2) 

But the form of the density g(A; a) is known. Thus if the condition 
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(2.3.1) holds, knowledge of /(x|A) and h(x; S) uniquely determines 
the prior density g(k\ a). The condition (2.3.1) readily extends to the 
case when X and A themselves are vector random variables. Of 
course (2.3.1) need not be checked if a unique function <f> can be 
found in (2.3.2). 

We give some simple examples below using conjugate prior 
distributions. They indicate that condition (2.3.1) can be readily 
checked so that the question of identiflability can usually be settled 
easily. 

Example 2.3.1 Let p(x| A) be the geometric kernel 
p(x|A) = (l — k)k x , x = 0, l,...,oo. 

The conjugate prior density is of beta type: 

g(X\p, q) = A p_1 (l - k ) q ~ 1 dk/B(p, q), p, q, > 0. 

Then 

p G (x) = B(p + x,q+ 1 )/B(p, q). 

Further the posterior density is also of beta type: 

b(A|x) = A (p+X >' 1 (1 - A)« + J) - 1 dk/B(p + x, q + 1). 

Thus the parameters of the posterior density are (p + x, q + 1), which 
can be expressed as a simple function of the prior distribution 
parameters p, q. Since there are no other parameters in p(x|A) except 
A, specifying p and q in p G (x) leads to a complete specification of the 
parameters of dB( A|x). Thus G is identifiable if it does belong to the 
specified class. 

Example 2.3.2 Let p(x|A) be the binomial kernel given by equation 

(2.2.1) and g(k;a) the beta density of the preceding example. Then, 
the posterior density of A given x is also of beta type with parameters 
p + x, q + (n — x). In this example using the notation of (2.3.1), P — 
{p,q,n}, 6= {p,q}, 0- {«}. Thus knowledge of 0 and S completely 
determines /?. Thus G is identifiable when it belongs to the beta family. 

Although identiflability has been established in this case, it is 
worth noting that in section 2.4 we treat the case where the 
parametric G family is of a particular discrete type, and in that case G 
is not always identifiable. 

Example 2.3.3 Let /(x|A) be the N(k, a 2 ) density and g(k;a) the 
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N(ji,z 2 ) density. Then X G has a N(ji,cr 2 + t 2 ) density. Further, the 
posterior distribution of A given x is N(n*,o* 2 ) where 

Li*=(x/o 2 +ii/T 2 W/a 2 + l/T 2 ) 

a* 2 = (l/<7 2 + 1 /t 2 ) -1 . 

In the notation of (2.3.1), we have <5= {fi, a 2 + r 2 }, 0={a 2 }, 
fi= {<t 2 ,t 2 ,^}. Thus G is readily identified. 


2.4 Finite mixtures 

Let F(x|2j), j = l,...,fc be a family of d.f.s. Then the mixed d.f. 

Fg(*) = t 8jF(x\Xj), (2.4.1) 

j=i 

where 6 t >0 for all i, 6 l + ■■■ + 8 k = 1, is an example of a finite 
mixture of d.f.s. This finite mixture can also be regarded as a type of 
parametrization of G. For example, when are known, the 

form of G is known and it depends on the finite number of 
parameters 6 l ,...,d k . But it is more versatile than the strict para¬ 
metric approach mentioned in section 2.3 where a functional form of 
G needs to be assumed. More generally, we may have a mixture of a 
family of d.f.s, F/x), j= l,2,...,fc, which are not necessarily of the 
same form. Since these do not normally play any part in EB 
problems, we shall consider only families in (2.4.1). We shall also 
introduce further simplifications as follows: given that F c (x) is a 
finite mixture of the type (2.4.1), both 6j and j = k may be 
unspecified. However, we shall concentrate on the two simpler cases 
where either the 0/s or the A/s are given. 

Finite mixtures arise in problems of deciding between a finite 
number of alternative hypotheses. They are also important as 
probability models to describe some heterogeneous populations 
which can be regarded as being composed of a finite number of more 
homogeneous subpopulations (see e.g. Titterington, Smith and 
Makov, 1986). In the context of EB estimation, they are important 
because the mixing distribution G, being a step-function G k , with 
jumps of size Qj at the points X Jt can be used as an approximation to 
any G(A). This is a standard mathematical procedure, whose exploit¬ 
ation in EB work has been motivated by the work of Teicher (1963), 
and others, on the identifiability of finite mixtures. 
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2.4.1 Finite mixtures with X lt ...,X k given 

In this case the mixture (2.4.1) is a special case of the general mixture 

H(x) = £ 9jFj(x) (2.4.2) 

; = i 

where Fj(x) is the jlh component distribution function; F k ,...,F k are 
also assumed to be distinct members of a family J 5 " of known 
distribution functions. In this context, a general sufficient condition 
for the identifiability of H is that the F/s are linearly independent 
(Yakowitz and Spragins, 1968; Tallis, 1969). A sufficient condition 
for the linear independence of F 1 ,...,F k is that (Teicher, 1963) at 
least k distinct values x 1 ,...,x k exist such that the determinant, 

Fi(xi) F 2 (x,) — F k (x t ) 

F i(x 2 ) F 2 (x 2 ) ••• F k (x 2 ) 

Fi(x k ) F 2 (x k ) ■■■ F k (x k ) 

The case with known A,’s is the special case with Fj(x t ) = F(x t | Xj) in 

(2.4.3) ; F(x|2) can also be replaced by the p.d.f. f(x | A) or the discrete 
p.d. p(x|l). 

An equivalent condition, useful in certain cases considered below, 
is that for a certain value of an auxiliary variable t in an interval 
(— <5,5), with finite 5 > 0, the relation, 

ZW) = 0, (2.4.4) 

l=i 

implies that Xj = 0 for all j. Here ipj(t) is the characteristic function of 

Fy 

Example 2.4.1 The binomial kernel distribution. We have noted 
the non-identifiability in general of G in the case of binomial 
mixtures. However, when G is the finite step-function under dis¬ 
cussion, condition (2.4.3) is satisfied when k < n. The determinant in 

(2.4.3) can be reduced to the form 

d-^r 1 - (i-aj *- 1 

^i(i — A i) k 2 ••• 4( 1 -4)* 2 [] W-;yr‘ +1 . (2.4.5) 

I 1 5 = 0 

2‘r 1 - r 1 


¥=1). (2.4.3) 



FINITE MIXTURES 


33 


when we put x t = 0, x 2 = 1, etc. If the largest X } is ^1, the 
determinant is # 0, because the determinant, D, in (2.4.5) is ^ 0 if all 
Xj are distinct. This follows because, fixing X 2 ,X 3 ,...,X k , means that 
putting D = 0 yields a polynomial equation of degree k — 1 in X x . It 
has at most k — 1 solutions which we know to be X 2 ,X 3 ,..., A t . Hence 
D = 0 only if A x is equal to one of the other A’s, which possibility is 
excluded by hypothesis. Further, when say X k = 1, we need consider 
only D with the last row and column deleted, and a similar argument 
applies; we then also use 

!= 1 


Example 2.4.2 The geometric distribution. Application of (2.4.3) 
with x = 0, x = I, etc., leads to the determinant 


(l-^Hl-A^-O-AJ 


1 


1 

X 2 


1 

K 


A *" 1 A *' 1 


r 1 


which is #0 for distinct A^Aj,...,^, by the argument above. 


Example 2.4.3 The normal distribution, N(X, 1). Application of 
(2.4.3) to this case is awkward, but identifiability of G can be 
established in various ways. For example, putting 


/“r = 




x'f G (x)dx, 


we have 


J AdG(A) = p\ 

A 2 dG(X) = p' 2 — l 
A 3 dG(X) = p' 3 — 3fj.\ 


etc. 
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Thus we have linear equations in d lt 0 2 .0* of the form 

= r=l,2,...,k, 

i= i 

the coefficient matrix of which has a determinant of the same form as 
in Examples 2.4.1 and 2.4.2. 


Example 2.4.4 Translation parameter mixtures. Let F(x|A) be a 
member of a translation parameter family indexed by a translation 
parameter A; i.e. F(x| A) = F{x — A) where F( •) is of a known form. 
Application of (2.4.4) gives the result that H is identifiable if 

Z = 0 

}=0 

implies <Xj = 0, j k for some t in the interval (— 5,8) with 

5 > 0. Now i l/(t. A) satisfies the relation 

i^(t,A) = e i, V(t,0). 

Since i/r(t,0) is continuous and i/dO, 0) = L there exists a region 
(—<5,5) with <5>0 such that ^(t,0)>0 for te( — 5,5). Hence H is 
identifiable if for te(— 6,5), 

k 

Z Xj exp (if A ; ) = 0 
j=i 

implies <Xj = 0 (; = 1,..., k). This has been shown to be the case by 
Yakowitz and Spragins (1968). Thus the finite mixtures of translation 
parameter families are identifiable. 


Example 2.4.5 Scale parameter mixtures. Let F(x | A) be a member 
of the scale parameter family indexed by a scale parameter A, i.e. 
F(x|A) = F(x/A), where F( ) is of a known form. Let A = exp (0), 
X = exp(Y). Then the distribution of Y is 

Pr{ Y < y} = Pr(Y < e”} = F{exp(y - 0)} 

which is a location parameter distribution. Hence finite mixtures of 
the distribution of Y are identifiable. The relationship between X 
and Y requires that finite mixtures of the distribution of X are 
identifiable provided F( ■) possesses an rth moment for some r > 0 
(Behboodian, 1975). 
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2.4.2 Finite mixtures with 8 l ,...,0 k given 

Our main application of mixtures of this type is to problems where 
an unknown G is approximated by a step-function G k . In this case 
assuming 6j = \/k is convenient and represents no great loss in 
generality. A useful result is obtained by considering kernel distri¬ 
butions such that 


I 


x r dF(x\X) = X a jV> 
j=o 


(2.4.6) 


a polynomial of degree r in X. Let 

g! r = | x r dF G (x)dx. 

Now the equation 


X0;F(x|2j) = F g (x) 


gives 


* 

iAtioA-t, 

t= 0 7 


r= 1,2. 


Thus we have in general 

5M = a < 

where the a f ’s depend on Aj and g\. Since 9j= i/k we obtain the 
equations for Xj as 

Ai + X 2 + ■ • ■ + X k = Xj 

Xl+X* + —b X k =ct 2 (24 7) 

X\ + X\ + —b X k = a* 

which have k\ solutions. But, owing to the symmetry of the equations 
w.r.t. X u X 2 ,..., X k , the solution is unique if we impose the restriction 
Aj < X 2 ^ ^ X k . 


Example 2.4.6 The Poisson kernel, p(x| X) = X x exp(- A)/x!. The rth 
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factorial moment of X is 

£x(x- l)(x-2)---(x-r + l)p(x|A) = A r , 

X 

so that the rth moment is a polynomial of degree r in A. 

Example 2.4.7 The normal kernel, N( A, 1). The central moments do 
not depend on A, hence the moments about the origin are polynomials 
in A of the required order. 

Example 2.4.8 F(x|A) = F{(x — A)/n}. When the scale parameter, <r, 
is known, the argument of Example 2.4.7 applies here. 

2.4.3 Finite mixtures, continuous G 

As we remarked above, finite mixing distributions occur naturally in 
problems involving a finite number of simple hypotheses. In problems 
of testing composite hypotheses, continuous approximating d.f.s may 
be regarded as more suitable than discrete ones. A continuous finite 
approximation which has been used in such problems (Maritz, 1968), 
is depicted in Fig. 2.1, along with a corresponding step-function 
approximation. 



Fig. 2.1 




GENERAL MIXTURES: IDENTI FI ABILITY 


37 


This finite d.f. has the p.d.f. 

BAX 

dG* = —. r-r, forAj<A^A^ + i 

|Aj + 1 — Kj) 


j = 1. 

(2.4.8) 


As before, we shall assume that either the 0’s or the A’s are given. 

When the A’s are fixed, a condition for identifiability, similar to 
(2.4.3), can be written down. Putting 

QM)= r + 7(*U)dA, j = 1,2,... ,fc — 1, (2.4.9) 

Jij 

the condition is that k — 1 distinct values, x 1 ,x 2 ,...,x l _i,ofx should 
exist such that the determinant ( Qj(x t )\ ^ 0. 

This condition is rather cumbersome to apply in general, and again 
we consider the restricted class of distributions of equation (2.4.6). For 
such distributions we have 


Fr = 


x'dF a (x ) = 


{A 0 + A v k+ - + A r k r )dG*{X) 


A?+ ’ x 


(2.4.10) 


leading to conditions like those in Examples 2.4.1 and 2.4.2. 


2.5 General mixtures: identifiability 

The general mixture of a distribution function F(x| A) or a p.d. p(x|A) 
is given by (2.1.1) or (2.1.2), where G(A) is a distribution function whose 
support is an interval or a countably infinite set of values. Necessary 
and sufficient conditions analogous to (2.4.3) exist in these more 
general cases (see Tallis, 1969; Tallis and Chesson, 1982) under certain 
conditions such as continuity and square integrability of dF(x|A)/<3A 
in the case of continuous G. Application of such conditions is not easy 
in general. For some special classes of general mixtures, more special 
but easier techniques are available as shown below. 


2.5.1 Scale and location parameter families 

Sections 2.3 and 2.4 have dealt mainly with identifiability when the 
family of mixing distributions is restricted in some way, although 
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equation (2.4.6) defined a limited class of distributions F(x|A). We 
shall now give some attention to special types of kernel distributions. 
In this section they are of the form K((x — A)/<r), where A and a are, 
respectively, the location and scale parameters. Since we are mainly 
concerned with mixtures on one parameter, we shall assume that or is a 
known constant; mixtures with a variable and X constant are also of 
general interest, but in EB work they are less common. 


Example 2.5.1 The normal distribution, N(X, fi 2 ). Let fl = 1 without 
loss of generality, and consider the mixture 

fo(x) = ^f(x-X)dG(X), 

where f(x — X) = (l/ v /27r)exp {— %{x — A)}. The characteristic func¬ 
tion of the mixture is 


or 


[e ixt f G (x)dx = - X)dG(X) 

= e ~ ,2/2 je iM dG(X), 


„t 2 l 2 


e ix ’f G (x)dx = 


e u, dG(X). 


(2.5.1) 


Equation (2.5.1) shows that the l.h.s. function, which is uniquely 
determined by f G (x), is the characteristic function of G(X). Hence, by 
the inversion theorem for characteristic functions (c.f.s), G(A) is 
determined. Thus G is identifiable in all mixtures of normal distri¬ 
butions when the scale parameter is held constant. The identifiability 
of finite mixtures of normal distributions, noted before, is included in 
this result. 

More generally, let us consider mixtures 


H(x) = 


1 


K(x - A)dG(A), 


(2.5.2) 


where we have put a = constant = 1. A d.f. H defined by the r.h.s. of 
(2.5.2), is called the convolution of F and G, and is written H = K*G.lt 
is well known that the c.f., t l/ H (t), of H is 


•MO = 


(2.5.3) 
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where i j/ K and 4*o are the c.f.s of K(x) and G(A) respectively (see e.g. 
Lukacs, 1960). 

Now suppose that two distributions G 1 and G 2 generate the same 
mixture H. Then 

= <M0^G2(4 (2-5.4) 

so that, unless i j/ K (t) = 0 over a finite interval, 

•AgiW = '/'G2W. 

thus implying the identifiability of G. This general treatment covers 
the cases of the following example with a = 1. 

Example 2.5.2 The following distributions are well-known 
examples of the location-scale type: 

(i) the normal distribution: 

/(x|A,ff) = C exp | — j. 

(ii) the ‘extreme-value’ distribution: 

F(x | A, <x) = exp {- exp [ — (x — A)/<r]}, 

(iii) the Pearson Type VII distribution: 

/Wl,) ° »Btt.L-i) [ 1+!; v g ] * ,0 " ixedm 

2.5.2 Identifiability: additively closed families 

The d.f. F(x | A) is said to belong to the additively closed family of 
distributions if 

FMX^FWXJ = F(x|Aj + A 2 ). (2.5.5) 

In other words, if the independent r.v.s X 2 and X 2 have d.f.s F(x|A 1 ) 
and F(x|A 2 ), then the d.f. of X 1 + X 2 is F(x| + A 2 ). 

Example 2.5.3 A well known and important example is the Poisson 
distribution, which we consider first. The factorial moment generat¬ 
ing function (m.g.f.) of the mixed Poisson distribution is 

■ i* r 

£ (1 + t) x e~ x —dG(X) = e M dG(X). 
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Hence G(A) is identifiable if the factorial m.g.f. of the mixture exists. 


For a more general result we follow Teicher (1961) and again make 
use of characteristic functions. Denoting the c.f.s of F(x|A 1 ),F(x|l 2 ) 
by and i l/(t,X 2 ) condition (2.5.5) leads to 

^(t,A 1 )^(£,A 2 ) = ^(t.Aj + A 2 ), 

and the only solution of this functional equation has the form 

i l/(t, A) = e XC{,) . 

Thus 

\j/(t, 1) = e C(t \ 
or 

lA(t»A) = [^(r, 1)] A . (2.5.6) 

In this argument A > 0; negative values are excluded, for with negative 
values a suitable choice of Aj and A 2 would yield 

F(x|A 1 )*F(x|A 2 ) = a degenerate distribution, 

which is impossible unless F(x|A) is itself degenerate. 

Using (2.5.6) we see that the c.f. of the mixture H(x) can be expressed 
as 


where 


M0= f 

J x»o 


bKt. l)] A dG(A) 


dG(A) = 1. 


A>0 


Now the transform 


(2.5.7) 


<A(z; G) = 



z x dG(X) 


is analytic, at least in the annulus 0 < \ z\ < 1. If the d.f.s Gj and G 2 
yielded the same mixture H, then GJ and i/^(z; G 2 ) would 
coincide for z = i^(t, 1), and consequently throughout 0 < |z| < 1. This 
would entail 


>p(pe i, ;G i ) = >l/(pe i '; G 2 ), 

for p < 1, and hence, by the dominated convergence theorem, for 



THE IDENTIFIABILITY OF MULTIPARAMETER MIXTURES 41 
p = 1. Thus 

e ia dG i (X) = f e ux dG 2 (X) (2.5.8) 

Ja^o Jaso 

implying Gj(A) = G 2 (A), and the identifiability of G. 

Example 2.5.4 Poisson kernel. Identifiability also follows from the 
preceding argument. 

Example 2.5.5 Mixtures of T-distributions. Let 

dF(x\^y) = y x ^e-^dx, (2.5.9) 

The c.f. of this distribution is 

<W;A,y) = [l-(if/y)r\ 

Thus 

Mf, !#(*; A 2 , y) = Aj + A 2 , y), 

and the family of distributions (2.5.9), with y fixed, is additively closed. 


2.6 Identifiability of multiparameter mixtures 

In a general formulation of the problem of finite mixtures Teicher 
(1963) allows the distribution of X to depend on more than one 
parameter. The case of mixtures of one-dimensional d.f.s F(x \X), 
where X is p- vector, was considered. For a fixed set of points 
X r , X 2 ,...,X k with which are associated probabilities 6 lt 0 2 , ■ ■ ■, 0 k , the 
criterion (2.4.3) with X t in place of X still holds for identifiability of the 
discrete p-dimensional mixing distribution. The case when x is 
replaced by a p- vector has been studied by Yakowitz and Spragins 
(1968). Tallis (1969) and Tallis and Chesson (1982) also considered the 
vector variate-vector parameter case in a more general framework 
covering countably infinite mixtures and continuous mixing distri¬ 
butions G. The necessary and sufficient conditions for these more 
general cases are given; however with greater generality they become 
more difficult to apply. 

In EB applications we often have two-dimensional distributions 
of the r.v.s X l and X 2 depending on two parameters a and (3, usually 
such that E(X 1 1a,/?) = /?, and E(X 2 \a,/3) = a. Typical cases are X 1 , 
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X 2 being estimates of the slope and intercept in a linear regression, 
and X t , X 2 being the sample mean and variance of observations 
from a normal population. Some examples of identifiable mixtures 
follow. 


Example 2.6.1 The joint distribution of and X 2 is normal with 
mean vector (a, P) and known covariance matrix I. Then the joint 
mixed m.g.f. of X l0 and X 2G is 




dF G (x) 


= ^ e i ‘*dF(x\a,P)dG(a,P) 

= c -»'»/2p«*«+^)dG(a,/J) (2.6.1) 


where G(a, P) is the two-dimensional prior d.f. By the arguments of 
section 2.5, G is identifiable. 


Example 2.6.2 Suppose that X 1 and X 2 are independent for given 
(a, (1) such that 

F(x !, x 2 1 a, P) = F , (x j | p)F 2 (x 2 | a, P). (2.6.2) 

This form holds, for example, when X 2 is the mean and X t is the 
variance in sampling from a normal population. Then the marginal 
mixed X l -distribution has the p.d.f. 

/g(*i)= fa(XuX 2 )dx 2 
J *2 


/i(*il P)f 2 (*2 1 P) dx 2 dG( a, P) 


= f f f 1 (x 1 \P)dG(<x,P)= f hix.mdGM 
J« P J P 


Now if G(a, P) is a finite distribution with masses 9j concentrated at 
the known points (a J( Pj),j = 1,2,..., k, then the 6j are determined by 
the marginal ^distribution if the one-dimensional G 1 (/S) is identifi¬ 
able. If no two a’s have the same associated P, G(a, P) is thus identified. 
When the P’s for different a’s are not necessarily distinct, it is still 
possible to establish identifiability. For example, if we have four 
masses i,j= 1,2 concentrated at the points (a„ Pj), then 
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0.i = 0n + 02 i and 0. 2 = 0 l2 + 0 22 are determined by the marginal 
.^-distribution. 

Now, 


x 2 dF G (x u x 2 ) 


- f f f f x 2 dF 1 (x 1 \f})dF 2 {x 2 \a., f})dG{<x, fi) 

Jx 1 JX2J(XJP 


= (otj - a 2 )0!,+(<*!- a 2 )0i2 + a 2 (0„ 4- 0. 2 ), 


and similarly, 

H'n = (<*1 - “ 2)^1 011 + (“1 - <* 2)02012 + * 2010.1 + <* 2020 . 2 - 
Hence we have two linear equations 

011 + 012 = Ooi + <* 2 ( 0.1 + 0 . 2 )]/(<*l ~ <* 2 ) 

01011 + 02012 = t>'ll + <*2010-1 + <*2020.2]/(<*l - <*2). 
from which we can find 0 n and 0 12 uniquely. 


Example 2.6.3 Let the prior distribution have four equal masses at 
the points 

( 0 i. a n). ( 0 i> *12) 

( 0 2><*2l)> (02. <*22) 

and let F(x u x 2 |a,0) be of the form (2.6.2). Then, by the same 
reasoning as above, 0, and f} 2 are determined by the marginal X r 
distribution. Now suppose that F 2 (x 2 |a, /?) is such that 

^ 2 dF 2 (x 2 | a, /?) = B o ,(0) + B lr (0) a + - + B rr (0)a', 

i.e. a polynomial of degree r in a. Then putting « t . = a u +a 12 , 
<*2-= <*21 + <*22. we have 

®n(0i)<*i> + Bn(02)<*2. = ^Poi ~ 2[B O i(0i) + B O1 (0 2 )] 

0 i®n(0i) a i* “b 02 ^n( 02 )<* 2 > = ^Mn — 2[0iB ol (^j) + 02^oi(02)3> 

and these two equations yield solutions for a lp and a 2 . when f} t # 0 2 . 
Similarly, we can solve for (af x + af 2 ) and (a 2J + a 22 ) from 

^2 2(0l)(<*ll + <*12) + ^22(02)(<*ll + <*22) = -Di 
01 -022(01 )(<*11 + <*12) + 02 -® 22 ( 02)(<*2 1 + <*22) = ^2> 
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where 


•Dj = 4^T 02 — 2[B 02 (/?i) + B 02 (j? 2 )] — B 12 (p 1 )a 1 . — B 12 (P 2 )a 2 , 
D 2 = 4/i' 12 — 2[/J 1 B 02 (/l 1 ) + P 2 B 02 (P 2 )] 

— PlBi 2 (fii)&i. — ^2^12(^2) a 2»- 


Thus we have two sets of equations 


a u+ a i2 — C11 a 21 + a 22 —C 21 

a ll+ a 12 = Ci 2 *21 + a 22 = ^22 


which, owing to their symmetry, give unique solutions for the a’s. 


Examples 2.6.2 and 2.6.3 can be generalized in an obvious way to 
k = r x s mass points. Another useful extension is obtained if we relax 
the requirement that X l and X 2 are independent for given a and p, as 
expressed by equation (2.6.2). If the marginal distribution of X u for 
given a and /I, depends only on P and not also on a, then we again have 

Consequently, in a finite mixture of the type considered in 
Example 2.6.3, the P's are determined by the marginal X v - 
distribution if G X (P) is identifiable. Now if 


J*! A 


' 2 dF(x u x 2 \a, P )« B 0 JP) + B lsr (P)* + - + B,„(p)a r , 


then the a’s are determined by steps similar to those of Example 2.6.3. 


2.7 Determination of G(p) 

Our survey of results relating to the question of identifiability of G has 
shown that, in theory, G can be determined when either G or F is 
known to belong to certain classes of distributions. We observe that 
these classes contain most of the distributions occurring commonly in 
applications. However, the actual determination of G{p) remains a 
non-trivial task, and the difficulties are increased when H is known 
approximately. 

Let us examine the class of translation parameter distributions 
considered in section 2.4. We found that 


(2.7.1) 
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every ip being a characteristic function. Thus 

•AgW = iMO/lMO 

is a c.f., and by implementing the inversion theorem for c.f.s G can be 
determined. In EB problems such a direct procedure is not possible. 
The d.f. H is approximated by the empirical d.f. F n (x), a step-function, 
and it may happen that ^ f „(t)/lM0 is not a c -f- Indeed, it is obvious 
that in those cases where F(x | X) is continuous at all x for every X, any 
mixture of T(x|A) is also continuous, so that no G exists such that 


F a (x)~ 


F(x\X)dG(X) 


(2.7.2) 


is satisfied exactly. We shall return to this question in section (2.7.2) 
and consider a method of constructing an approximate solution to 
the basic equation (2.1.1) which can also be applied to (2.7.2). 

When G is restricted to certain narrow classes of distributions as in 
section 2.3, it is usually possible to determine G rather easily. 
Examples 2.3.1,2.3.2 and 2.3.3 illustrate how this can be done. In the 
case of finite mixtures with X } given, 9j can be found by solving suitable 
linear equations. Alternatively, with Oj given, solutions to equations 
like (2.4.7) may be sought. In either case, lack of precise knowledge of 
H will mean that the r.h.s. of the equations are not known exactly, and 
that exact solutions do not exist. We may then consider minimizing 
the differences between the l.h.s. and the r.h.s. by suitable choice of the 
unknown (Xj or 0,). In such a prodedure, some criterion, for example 
weighted least squares, would have to be used in the actual 
minimizing process. Of course, when the r.h.s. is known exactly, the 
process should yield a ‘residual sum of squares’ of zero. The methods 
which are discussed in section (2.7.2) are logically analogous to the 
above proposals, are motivated by the method of maximum like¬ 
lihood, and may be regarded as one way of overcoming the problem 
of choosing the weights for the least squares procedure. 


2.7.1 A measure of the distance between two distributions 

Consider two sequences of probabilities pj and Zj (j = 1,..., k), such 
that 
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Let the p be fixed and let 

HP, z) = X Pjlog(pj/Zj) (2.7.3) 

j=i 

be a function of z u ...,z n . Then I(p,z)> 0, except when z } = Pj, 
j = 1,..., k, in which case I(p, z) - 0. To prove this we put rj = pj/zj 
so that 

k 

I(P,z)= X ( r jl°g r j) z j- 

J= 1 

Now the function (j>(t) = t log t has 4>( 1) = 0, and 

<£'(t) = 1 + log t 

rw=i/t, 

so that 

m = m+(t-\w(i)+& - im«) 

= (t-i)+i(f-i)V"(«), 

where u lies between t and 1. Hence 

I(P, z)= X - l)z; + 2 X (fj-l) 2< l>"( u j)Zj 

j=i 

= 2 X (0 ~ (2-7.4) 

j=i 

where Uj lies between and 1, so that Uj > 0. We note that <f>"(t) = 
1/t > 0 for t > 0, and therefore I(p, z) ^ 0. 

The non-negative I(p, z) may be regarded as a measure of‘distance’ 
between the two probability distributions pj and Zj, j= 
(Kullback, 1959). Its definition can be extended to the more general 
discrete case as k -* oo, and to continuous distributions. In the latter 
case, let h(x) and w(x) denote two p.d.f.s, then 



= J log {h(x)}dH{x) - J log {w{x)}dH{x) 
>0 . 


(2.7.5) 


We observe that minimizing I(h, w) for fixed h(x), by suitable choice of 
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w(x), is equivalent to maximizing 

L(h, w) = Jlog (w(x)}dff(x). 

It is of interest, and relevant, to note a connection with the method 

of maximum likelihood. Suppose a sample x i .x„ is drawn from a 

population with d.f. H and that the empirical d.f. H n is constructed. If 
we believe that H is actually a member of a certain parametric family 
of d.f.s, say the normal family with p.d.f. 

w(x; n, a 2 ) = (l/ff y2n) exp {- (x - gf/lo 2 }, 

we can estimate // and <x 2 by selecting them in such a way as to 
maximize L(h, w), with H replaced by H n . This leads to maximization 
of 

1 * 

- £ log w(xy,n,(7 2 ) 
n j= i 

by varying g. and a 2 , and corresponds to the usual maximum 
likelihood estimation. 

2.7.2 Determination of mixtures by minimizing /(•, •) 

Let H(x) be the mixed distribution defined by (2.1.1) and put 

F g .(x) = JV(x|A)dG*(2). (2.7.6) 

Then H = F G . Now suppose we seek G(A) such that I(H,F G .) is 
minimized. We know that I(H, F G ) = 0, thus the minimization will 
yield G* in such that 

|F(x|2)dG* in (A) = //(x), 

and if G is identifiable, G* ln = G. 

The above procedure may seem rather roundabout, but it has the 
obvious advantage that it can be applied when H(x) is not known 
exactly, but is estimated by an empirical d.f., as happens in EB 
applications. The result is that a maximum likelihood estimate of G is 
obtained. Another useful aspect of this approach is that it affords a 
way of constructing a solution to (2.1.1) by successive approximation. 
Motivation of this approach may be found in the work of Teicher 
(1963) and Robbins (1964). 
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First we observe that a sequence {G?*} of step-functions with 
jumps of size l//c at Af* < ••• < AJ*, can easily be constructed such 
that GJ* -»G (weakly) as k -» oo. For example, let G(A**) = (;' — ^)/fc, 
j=l,...,k. This is adequate when G is continuous, but suitable 
conventions for defining the Xf* in other cases are easily defined. For 
example, if G(A' + 0) = G(A" — 0) = (r — $)/k, put A** = (A' + A"). Now, 
since G** -»G, it follows from the Helly-Bray theorem that 




F(x | A)dGJ*(A) -+ H(x) 


for all x, and I(H, 0. Certain restrictions on dF n ,,, dF a and G 

are required for the truth of these statements (Maritz, 1967). 

We also define the sequence {G k } of d.f.s with jumps of size 1/fc at 
A t ^ < A t , such that, for every k, I(H,H k ) is minimized. If G is 

identifiable, G k ->G (weakly). For, suppose that G k -j*G, then 
I(H,H%*)-f* 0. But we know by definition that 

thus contradicting I(H , H k ) -/* 0. 

An advantage of constructing a solution by this method of 
successive approximation is that it can be used when H is not known 
exactly, but is estimated by an empirical d.f., H n . Other measures of 
‘distance’ between two distributions can, of course be used, but the 
principle of successive approximation remains essentially the same 
(cf. Deely and Kruse, 1968). We note, as in section 2.7.1, that, in 
practice we would maximize L(-, •). 

Another aspect of this process becomes important when we 
consider the most realistic situation of nothing being known about G; 
in particular, its identifiability may be in doubt. We can then adopt a 
slightly more conservative approach, observing that I(H,H k )-> 0 
implies \dH(x)/dH k {x)\ -* 1, except possibly over a set B k such that 
\ Bk dH{x)-* 0. This means that if a reasonably small finite k yields a 
small I(H,H k ), we shall have \dH{x)/dH k (x)\ small, except possibly in 
the ‘tails’ of the H distribution. Let us now suppose that, over a A- 
range, R, such that 

/* 

dG(A) > 1 - e, 

Jit 

with e small, we can approximate /(x|A) (or p(x|A)) by a 
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polynomial, 

/(x|A) = A 0 (x) + A k (x)x + • ■ • + A s (x)A 5 . 

Then we have 


£ Aj(x)A J dG k (A) ss £ Aj(x)VdG{X). (2.7.7) 

J=0 j=0 

Thus, if a number > (s + 1) of values of x exist such that the 
determinant 

I A j(x,) |*0, (2.7.8) 

then we have, 

J \ J dG k (X) = jVdG(A), j= (2.7.9) 

Identifiability of G only affects the determination of G indirectly. 
Any restrictions on G k to render it determinable are clearly permis¬ 
sible, but the fulfilment of (2.7.8) will dictate the accuracy of the 
approximation of G by G k which can be achieved. For example, if 
p(x|A) is the binomial kernel B(x;n,A) then there are only (n+ 1) 
distinct values of x, and hence s can at most be n. This clearly places a 
limitation on the accuracy of the approximation, because, as (2.7.8) 
shows, we can approximate no more than the first n moments of G(A). 
We shall see, in section 3.7, that in the case of EB point estimation, the 
question of identifiability of G becomes unimportant, especially when 
var(A) is small. 

While we have been concerned mainly with the properties of G k as 
k -» oo, the arguments clearly suggest that it may be possible to obtain 
a satisfactory approximation to G by a G k with reasonably small k. It 
may be remarked that such a process is analogous to the approxi¬ 
mation of a curve by a polynomial of increasing order. 


2.8 Estimation of G: parametric G families 

In EB problems we shall mostly be concerned with cases where H(x) is 
not known exactly, but is estimated by an empirical d.f., H n , obtained 
from a sample of n observations on the r.v. whose d.f. is H. Thus we 
have instead of the exact equation 

H(x) = (V(x|/l)dG(D, (2.8.1) 



50 


ESTIMATION OF THE PRIOR DISTRIBUTION 


the approximate relationship 


H„(x)c 


F{x\X)dG(X), 


J 


( 2 . 8 . 2 ) 


Following the idea of minimizing /(•,), that is, maximizing L(v), 
presented in section 2.7, we now obtain an estimate G of G by 
maximizing L{H n , F G ). The search for G would be an impractical task 
unless some restriction were imposed on the function G. In particular, 
it must be a d.f. As we have noted in section 2.7, the problem is largely 
overcome if it can be assumed that G belongs to a given parametric 
family G(X; a,/?,...) of distributions, where a,/?,... are unknown 
parameters to be estimated. Maximizing L(H„,F a ) then means 
maximizing 


£ log(2.8.3) 

i— 1 

where 

/gM = /(*;A • • •) = [f(x\X)dG(X; a, /?,...). 

In other words, since F c (x) now belongs to a certain parametric 
family, we can use the method of maximum likelihood to estimate 
a . In these circumstances, other ‘conventional’ methods of 
estimating a, /?,..., are, of course, feasible, and may be preferred. The 
preference may be merely for computational reasons, if many 
estimates are to be calculated as a matter of routine. 


2.8.1 The method of moments 

One may use the observations x l ,...,x n to estimate the parameters 
= (a,/?,...) of the marginal p.d.f. f G (x) by the usual method of 
moments. We note that if 


H' r (X) = X 

« 


^x r dF(x\X), r= 1,2,..., 
the rth-order moment of F(x|2), and if 


/4(l) = 


x r df a {x) = J x r 


h(x\%)dx, 


then 


m)=E G nm 


r= 1,2,.... 


(2.8.4) 
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The left-hand side of (2.8.4) can be estimated directly by the sample 
moment 


m’r= £ A/n. 

i=l 

Thus if the functional dependence of y! r (%) on £ can be determined 
explicitly, a sufficient number of estimating equations can be written 
down for determining elements of 


2.8.2 The method of maximum likelihood 


For many parametric d.f.s G(A||) satisfying some regularity con¬ 
ditions, it is possible to develop a general iterative technique to 
estimate the unknown parameter £ by the method of maximum 
likelihood. Consider the likelihood function (2.8.3). The likelihood 
equations for <can be written as 


0 = 


dlnL 


dtj A*(*il«)J 




din g(m 

Kj 




(2.8.5) 


provided the interchange of differentiation and integration can be 
effected in the expression given by 


d 

Kj 


j/(x|A)0(x|£)A 


where g(X\lf) is the p.d.f. of G(A|§). The equations (2.8.5) can be 
rewritten as 





where the expectation is with respect to the posterior d.f. of A given 
X = x t . We can expand the quantity 


din gm) 

Kj 

at § =s | <0) , a known initial value of to obtain 


d\ng(X\0 _dlng(X\i^) ± f pMn^)] 

dtj dtj 1 .dSdZj J { =i<»>‘ 
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Thus we have a system of equations for £ u ’s in terms of initial values 
&°>’s as 


0 = 

+ 


L L — ji — 
i=l L 


= x, 

-W c 


i i(L-t a) )E\ 82 f^° x= Xi 

i=lu=l L J{-{ <0> 




( 2 . 8 . 6 ) 


Hence a system of iterative equations for updating an initial estimate 
§ (0) of § is obtained by rewriting (2.8.6) in a matrix form: 

S<'+u = £<o + {X (£« x )}- 1 u($ ( ‘ ) ,x), i = 0,l. (2.8.7) 


where U (£, x) is a q x 1 vector whose ;th element is 


£ Jd\ng(A\0 



and K( x) is a q x q matrix whose (j, t)th element is 


y / d 2 \ng(K\Z) 
i k \ dtjdt, 




The equation (2.8.7) can be used to obtain an iterative process by 
setting i = 1,2,.... The above iterative procedure is a result of an 
application of the EM algorithm (cf. Dempster, Laird and Rubin, 
1977). The details are given in a more general setting in section 2.12. 


Example 2.8.1 Let 

p{x\X) = e~ 2 k x /x\, x = 0,1,2,..., 

dG(A;a,)?) = -^^“ 1 e“^, aj>0, 

r(p) 

then 


P g(x) = 


a yr(p + x){ 1 


x = 0,1,2,..., 


,a+l J T(P)x\ \a + 1 
the negative binomial distribution. Its mean and variance are 
P/a and P/a + P/a 2 
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respectively; the distribution is often reparametrized by putting 
/3/a = m, /S = p, so that 


mean = m 


variance = m + m 2 Ip) 


( 2 . 8 . 8 ) 


The moment relations (2.8.8) can be used to estimate m and p. If x and 
s 2 are the sample mean and variance of x 1( ..., x„, we have estimates rh 
and p of m and p as 


m = x 


x 2 /(s 2 - x), 

+ 00 , 


S 2 >x > 
otherwise. 


(2.8.9) 


The maximum likelihood estimates of m and p are m and p, 
where 


rf I = x, 

and p is that positive value of p which maximizes 

p Yfr, r (p+x { ) }( x y. 
p + x) (M r(p)x f ! J \P + x) ’ 

p = + oo is a permissible solution. 


Example 2.8.2 Let 

/(x|A) = 7V(A,l), 
dG(A) = N( Jl ,a 2 ), 

then f G (x ) = N(p, 1 + <r 2 ), and the ‘natural’ moment estimates of p 
and a 2 are provided by 

(l = x 
1 + a 2 = s 2 , 

where x and s 2 are the sample mean and variance of the past 
observations. Since <r 2 ^ 0, our estimate of a 2 is max (0, s 2 — 1). The 
maximum likelihood estimates of p and a 1 are 


ft = x 
& 2 = max 
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2.9 Estimation of G: finite approximation of G 

Realistically we may expect to have reliable knowledge of the 
functional form of G only in exceptional cases. More commonly, our 
knowledge of G will be rather vague, consisting perhaps of the 
information that X is limited to a certain finite interval, for example, as 
in the case of the binomial p(x|A). The discussion of section 2.7.2 
suggests a possible solution to the problem, namely, to replace the G 
of the unknown functional form by a finite G k . By making k large 
enough, the approximation can be made arbitrarily good, according 
to some reasonable criterion. In fact, as we shall see in Chapter 3, for 
certain purposes k can be quite small. 

Implementation of this suggestion requires that we treat the 
observations x i ,x 2 ,...,x k as if they are the results of random 
sampling from a population with d.f. 

Fg„(x)= t OjF(x\Xj) (2.9.1) 

j'=i 

(Maritz, 1967). The mixture likelihood can be approximated by 

L Gt = t in{f Gk ( Xi )}. (2.9.2) 

j=i 

Estimation of the 2’s (or 0’s) in this approximate model by the 
method of maximum likelihood corresponds, again, to an application 
of the method of section 2.7.2, with the l.h.s. of (2.7.1) replaced by the 
empirical d.f. of the observations x k ,...,x H . 


2.9.1 X 1 ,...,X k given; 9 1 ,...,9 k unknown 

The EM algorithm described by Dempster, Laird and Rubin (1977) is 
particularly useful for finding the maximum likelihood estimates of 
the 0’s in this case. Starting with initial trial values 9 { r °\ r = 1,2,.. k, 
new values 9 ( r l> are obtained as follows: 

W = 9l 0) f( Xi \X r ) I ^ 9f ) f(x i \X j ) 

0« 1) = X W/n. (2.9.3) 

i= 1 


Similar iterative sequences are obtained by the methods of Day (1969) 
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or Behboodian (1975). For starting values one can take 0* o> = 1/fc 
(r= l,...,fc). 

Lindsay (1983a) linked this procedure with the general approach of 
the vertex direction method explained in more detail in section 2.10. 
In particular it was pointed out that the EM algorithm terminates at 
restricted maxima and hence a check for global maximality should be 
carried out. In particular a simple first-order check was given as a 
verification of the condition 

D"(A; G k ) 0 at A = A 1 ,...,A* (2.9.4) 

where 

D"(A; G k ) = t n^f(x*\l)/f Gk (x?)j. (2.9.5) 

The quantities rij and xf are as defined for D(A, G) mentioned in 
relation to the vertex direction method. 


2.9.2 00 k given; X lt ...,2. k unknown 

For convenience we shall take 0, = 1/fc for r = 1,2,..., k, and for the 
sake of identifiability we impose the restriction A t < A 2 < • • • A*. We 

now have to maximize L 0k in (2.9.2) with respect to A,,..., A t . As in 
section 2.9.1, the EM algorithm is useful here and particularly easy to 
apply for certain forms of‘regular’ /(x |A). Starting with trial values 
A' 0) we first compute 

w-Aximj t / wad , 

for r = 1,2,..., k and i = 1,2,. .., n. Then new values A* u are obtained 
as the solutions of the equations, in A r , 

t z‘? ) dln/(x i |A r )/5A r = 0. 


Example 2.9.1 If /(x|A)is the jV(A, 1) density straightforward substi¬ 
tution gives 


Ar 1 ' = Z Z<? , X f 


7 (?> 


Applications of the ‘G*-ML’ process to specific EB problems are 
given in later chapters. 
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At this juncture we observe that, having decided to treat the 

observations as being generated by the d.f. F(x; l x ,l 2 .4)> we ma y 

contemplate using other ‘conventional’ statistical methods for esti¬ 
mating the A’s. In certain instances the method of moments has 
appeal. 

Example 2.9.2 Let /(x|A) be as in Example 2.9.1, and suppose that 
k — 3 with 8 1 = 6 2 = 6 3 = Then the low-order moments of X G are 

n't = M 

p 2 = l + [A 2 3 - [A] 1 

^3 = [A 3 ]-3[A][A 2 ] + 2[A] 3 , 

where [A r ] = +1 2 + X 3 )/3. This suggests the following method of 

determining approximate A’s using the sample moments, denoted by 
m’s, of x !, x 2 ,.. ., x„: if m 2 > 1 it is always possible to find X’s such that 
m\ = [X] = /I'i, and m 2 — fi 2 . Hence choose A’s satisfying these 
equations which minimize | m 3 — fi 3 \. When m 2 < 1, put all three A’s 
equal to m\. 

Note that employment of the method of moments is tantamount to 
using a certain measure of distance between distributions. Still other 
measures of distance have been considered in connection with the 
same problem. A measure based on the x 2 test of goodness of fit 
appears in Maritz (1967); Choi (1966) uses the Wolfowitz distance; 
Bartlett and Macdonald (1968) use the method of least squares; while 
Deely and Kruse (1968) propose 

‘distance’ = || H(x) - F Gk (x) || = sup | H(x) - F Clt (x)|. 

They remark that once the A’s are chosen, the problem of finding 0’s to 
minimize the distance can be reduced to a linear programming 
problem. 


2.10 Estimation of G: continuous mixtures 

We now consider the estimation of a general mixing distribution 
without assuming any parametric form for it or approximating it by a 
finite distribution. The method of estimation employed is maximi¬ 
zation of the criterion L(h, w) of Section 2.7.1 as applied to obtaining 
the distance between F a and the observed empirical d.f. H„. We have 
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UH n ,F G ) = 


ln{f G (x)}dH n (x) 


= « 1 £ ln{/ G (x,)}. 

i = l 

Maximization of the criterion L(H n ,F a ) is of course effected by 
maximization of the mixture likelihood 


L G = 2>(/g(*i)}- (2-10.1) 

i= 1 

In the above / c (x) = J/(x| A)dG(A) when X is continuous. When X is a 
discrete r.v. f(x\X) is replaced by p(x| A). 

The nature of the mixture likelihood L G has been investigated in 
detail by Lindsay (1983a) for the general case when the form of f(x | A) 
is not specified and by Lindsay (1983b) for the case when f(x |A) 
belongs to the exponential family. The main results of the papers are 
concerned with the existence of the maximum likelihood estimator 
under certain conditions which specify the nature of the curve 

r = {f A :Aefi} (2.10.2) 

where 

f i = (/(x?|A),...,/(xr|A)) (2.10.3) 

is the vector of likelihoods of s ( < n) distinct values xf,..., xf of the 
observations x A ,..., x„ and Q is the parameter space of A. As to the 
determination of an ML estimate G when it exists, Lindsay (1983a) 
provided a vertex direction method (VDM) for constructing such a G 
iteratively. 


2.10.1 Vertex direction method 

Let be the number of times xf in (2.10.3) appears in the sample. In 
addition to (2.10.3) define the quantities: 

= (/gO** )> • • • /g( x *)) T 

D(A;G) = t nj{f(xf\X)/f G (xf)-\} 

7=1 

and 

J( u)= E In U; 

i = 1 


for any vector u = («!,..., u s ) T . 
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Let G <0 be an estimator of G at the ith step. 

Step (i): Find k (i) to maximize D(A; G) at G = G (i) . 

Step (ii): Find e (,) to maximize 

J(f G (l-e) + £ f,) 

at G = G (i) and k = k (i \ 

Step(iii): Define 

d(zU®)=»{* lfz = 1 ^ 

(0 otherwise. 

Set the (i + l)th step estimator of G to be 

G (i+1) (z) = (1 - e (i) )G (i> (z) + e (,) <5(z|A (i) ). 

Step (iv): Repeat the iteration until G (1) converges to G. 

Practical implementation of the VDM algorithm requires numer¬ 
ical work. A closely related algorithm has been given earlier for the 
special case of Poisson data d.f. by Simar (1976). An algorithm for 
nonparametric estimation of G has been developed for a more general 
case by Der Simonian (1986). 

2.11 Estimation of G : miscellaneous methods 

Rutherford and Krutchkoff (1967) take a position intermediate 
between those of sections 2.8 and 2.10. They assume that G is a 
member of the Pearson family of distributions. The d.f. F(x \ k) is taken 
to be such that known functions h k {x), k= 1,2,3,4 exist giving 

J \ k (x)dF(x\k) = k k . 

Then it follows that 

J^(x)dF G (x) = ^k k dG(k), 
and |A*dG(A) can be estimated by 

M*,„ = (l/n) £ h k ( Xi ). 

i= 1 

If jk*dG(k)< oo, M k jk k dG(k), in probability, as n increases. 
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Hence, for n large enough it will be possible to identify the particular 
Pearson curve which obtains, and to estimate its parameters using the 
M ki „. The authors give an example of the application of this 
procedure. 

Rolph (1968) examines the special case where X lies in the interval 
[0,1], and X is a discrete integer-valued r.v. such that 

p(x|l)= £ a xj X\ 
j = i 

a polynomial in X. The mixed p.d., p a (x), is then characterized directly 
by the moments of G. Rolph’s procedure consists in proposing a prior 
distribution for the moments of G, and developing Bayes estimates of 
the moments. 

Lord (1969) studies the case where X is discrete, assuming the 
values 0,1,2and X lies in the interval [a, b], with the p.d.f. g(X). 
Let t r (A) be any function such that the integrals 

m rx = | t r (X)p{x\X)dG(X), r,x = 0,l,2,...,n, 

and the inverse of the matrix || m rx || exist. Putting || m rx ||" 1 = || m xr ||, 
and 

n 

W r = X m xr p G (x), 

r = 0 

it can be easily verified that 

g(X) = £ w r t r {X) 

r = 0 

is a solution of (2.1.1). Lord demonstrates the existence of functions 
t r (X), defines smoothness criteria for g(X), and develops a method for 
determining w r using the calculus of variations. He extends his 
method to the case where p G (x) is estimated by relative frequencies, 
introducing as a measure of distance between the ‘observed’ distri¬ 
butions a criterion similar to the / 2 criterion for goodness of fit. 

2.12 Estimation with unequal component sample sizes 

In practical EB problems there may be unequal numbers of 
observations at different components. Also, the parameter X of the 
data d.f. F(x| X) may be a p- vector, a realization of a vector r.v. A with 
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d.f. G(A|§). The corresponding to realization X t (i = l,...,n) of A, 
there will in general be m ( (> 1) observations x n ,..., x imi denoted by 
X; on X. Techniques of estimating G need to be extended to cover 
these realistic situations. The method of ML estimation discussed in 
sections 2.8,2.9 and 2.10 is readily extended to deal with EB schemes 
of unequal component sample sizes. 


2.12.1 Parametric G families 

Suppose that G belongs to a parametric family G(X |§) indexed by an 
unknown parameter vector | = (<!; •,,..., ^). The likelihood function of 

£ can be obtained by using the marginal p.d.f. of x, (i = 1. n) given 

by 

fc(x,;«,»»,) = (2.12.1) 

where 

nn 

s(x i \X,m i )= UnxijW- (2.12.2) 

]= i 

Thus the log-likelihood function of £ based on the previous data 
Xi,...,x„ is 

lnL= £ In /j(x ; ; £, m t ). (2.12.3) 

i=l 

We can maximize (2.12.3) with respect to §, by using a direct 
optimization algorithm. Alternatively, we can apply an EM al¬ 
gorithm (see Dempster, Laird and Rubin, 1977), to obtain an ML 
estimate | m of £ based on (2.12.3). 

The joint p.d.f. of (X h x ( ) (i = 1,..., n) from an EB scheme can be 
written as 

9(*i I £)s(Xj I *,•,"!.)• (2.12.4) 

In the terminology of the EM approach, the ‘complete data’ is 
{(Aj, Xj),..., (A„, x„)} of which the ‘unobservables’ are X u ...,X m . The 
log-likelihood function of the complete data is 

In L(X u ...,X„;x l ,...,x n ;4)= £ lng(X i \£)+ £ lns(x i |A„m i ). 

1=1 »=1 

(2.12.5) 
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Now the posterior d.f. of A given (x ( , m,) is 

dB{l | x„ G, nti) = {h(\ t , G, m t )} ~ 1 {sfa | X, /n { )} dG(X \ £). (2.12.6) 

The £-step of the EM algorithm is carried out by taking the 
conditional expectation of the quantity (2.12.5) given the data 
Xj.-.-.x, and the prior d.f. G(A|| + ) where § + is a given initial 
estimate of We have 

£{ln£(A^,...,A„,xj,. • •, x„) | Xj,..., x„, 4 } 

= £ £{lngf(A|§)|x 1 .,4 + }+ X £(lns(Xj|A,Hj)|x„£ + } 

i = l i-1 

= X £{lng(A|^)|x f ,§ + } + Z(xj.x B ,^ + ), (2.12.7) 

i= 1 

where the term Z(x l5 ..., x„, § + ) is a function not depending on £ and 
the expectations are taken with' respect to d.f. of A given (x„ m t ) in 
(2.12.6). 

The M-step of the EM algorithm is accomplished by maximizing 
the quantity (2.12.7) with respect to £ for fixed X;’s and £ + - This leads 
to a set of equations 

X ^-£{lnfif(A|^)|x i ,§ + }=0, j=l,...,q. 

i=lVQj 

Under the regularity conditions which allow for the interchange of 
differentiation and integration in the above set of equations, we 
obtain 


Xfi^higfAI^IXi.r j = 0, j=l,...,q. (2.12.8) 


We can expand the quantity under the expectation sign in (2.12.8) in a 
Taylor series of ^’s. We then have 



j^-lnff(A|4)J 



Using the Newton-Raphson approach, we obtain an iterative 



62 


ESTIMATION OF THE PRIOR DISTRIBUTION 


sequence for updating § + as 

£( ,+1 > = £<‘) — J -1 U, i=l,2,... (2.12.9) 


where U is a q x 1 vector whose tth element is 


l_l ( (xi,...,x„;£ (i, )= £ E 


•f aing(A|fl l 

.1 n, U- 




<0 


4-4" 


and J is a q x q matrix whose (j, t)th element is 


J/r(*x,. 


x„;«? (i, ) = t E 

1=1 


81ng(A|g) 



In the iteration process, (2.12.9), | (1) is taken as the initial estimate § + . 
When the iteration converges, the resulting quantity | m is taken as the 
ML estimator of §. 

In the following chapters, the bias vector and covariance matrix of 
| m will be denoted respectively by 


*(© = £(&,)-£ 


( 2 . 12 . 10 ) 


and 

KS = £(l m - - £|J T (2.12.11) 

The mean square error matrix is then given by 

Hi) = L(^) + = £(| m - §)(| m -1). (2.12.12) 

Exact evaluation of the bias and mean squared error of is seldom 
possible. However, approximate quantities to terms of 0(n~ *) can be 
obtained by using standard likelihood methods (see e.g. Bowman and 
Shenton, 1973) as applied to (2.12.3). 


2.12.2 Finite step-function approximation to G 

Suppose next that G can be approximated by a finite distribution 
G k (A) having probability masses 9 u ...,9 k at the points a lt ...,cx k of 
the p-dimensional parameter space fi of X. The masses satisfy the 
constraints (9 1 + • • • + 9 k = 1, 0 < Q-, < 1). We consider here only the 
case where tx u ...,a k are assumed to be known and only 9 i ,...,9 k are 
to be estimated. This is the case where identifiability of G k is most 
easily established. Related cases such as when 0 1 ,...,9 k are known 
and ar„..., a k are unknown or all 0,’s and a,’s are unknown can also 
be treated in principle, but it is quite difficult to establish conditions 
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for identifiability. The approximate likelihood function approach 
demonstrated for the single-parameter case in section 2.9 is readily 
extended to the more general case of p-dimensional A and also of 
unequal sample sizes at various stages of EB scheme. The analysis of 
the single-parameter case can be repeated and an iterative sequence 
for 0 u 's can be obtained as follows: 

0i i + 1, = t € +l) /n 

t= i 

with 

W, + 11 = ^u°s(x,|a u , rrii) j £ 0fs(x t \a u , m,) 
where s(Xi\a u ,m t ) is computed from (2.12.2) with a u in place of A. 



CHAPTER 3 


Empirical Bayes point estimation 


3.1 Introduction 


An outline of Bayes point estimation has been given in sections 1.3 
and 1.6 and we recall that the Bayes estimate of A under the quadratic 
loss structure is 


S G (x) 


AdF(x|A)dG(A) 




dF(x|A)dG(A); 


(3.1.1) 


see also (1.3.3). The expected loss for any estimator <5(x), W(S), is 
related to W(3 G ) by 


W(d) = W(3 a ) + j”{«5(x) - 3 G (x)} 2 dF G (x) 
= W(6 G ) + K(5,3 G ). 


(3.1.2) 


Any estimate of <5 G (x) based on past data can be thought of as an 
empirical Bayes estimator. Most detailed studies of empirical Bayes 
estimators (EBEs) have been in the framework of the sampling 
scheme described in section 1.8: independent past observations 
x x , x 2 ,...., x„, obtained with independent realizations , A 2 ,.. ■, A„ of 
A comprise the past data. For the most part this scheme will be 
adhered to in this chapter. The overall expected loss of an EBE <5„(x) is 
defined as E„W{3„), as in section 1.10. 

In this chapter details of certain methods of obtaining EB point 
estimates are given, and they are applied to some of the standard 
distributions. The behaviour of W(3 n ) is clearly a topic of importance 
in assessing the performance of EBEs. Reasonably general statements 
can be made about asymptotic optimality of <5„, but studies of 
individual cases seem to be needed when n is not large. Some such case 
studies are reported, involving distributions like the Poisson, normal 
and several others. 
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Much of the writing on pure Bayes estimation is occupied with 
considerations of diffuse or non-informative prior distributions. The 
philosophical issue is reconciliation of the use of Bayes’s theorem for 
inference, while not claiming sharp prior knowledge of parameter 
values. In the EB approach, and EB point estimation in particular, 
our attitude is different. The introductory discussion of the perfor¬ 
mance of EBEs, section 1.10, suggests that an EBE would only be 
considered a serious competitor for a more conventional estimator, T, 
if W(5 g ) is considerably smaller than W(T). Take the example of 

X = N(X,o 2 ), A = N(n a , o g ), T = X, where W(T) = e 2 , W{b G ) = 
l/{ 1/cr 2 + l/ffjj}. If <rg = <t 2 we have W(b a ) = {\l2)W(T) and it 
would seem that some worthwhile gain may be possible with an EBE. 
But if a% = 10 a 2 we have W(b G ) = (10/1 l)Vk(T), and EB is unlikely to 
be much better than conventional estimation. Thus, the flavour of 
discussions in this chapter is that the dispersion of the distribution of 
A is small in some sense that is relevant to point estimation for the 
particular data distribution F(x\X). 


3.2 Asymptotic optimality 


3.2.1 Consistent estimation of <5 0 (x) 

Suppose that 3„(x) is an EBE, that is, a consistent estimate of S G (x) in 
the sense that <5„(x) -»<5 G (x), (P), for every x. Without being practically 
unrealistic some of the mathematical arguments to do with asymp¬ 
totic optimality can be simplified by truncating <5„(x) at finite lower 
and upper limits L and U respectively. This will also ensure that 
£»{<5n( x ) - $g( x )} 2 -*Q f° r x su ch that <5 c (x)e(L, U). 

Suppose also that 

[{<5GW} 2 ‘ff r cW< =0. (3.2.1) 


This condition is not very restrictive because we can write 


W(5 g ) = 




X 2 dG{X) — I b G (x)dF G (x), 


so that finiteness of |A 2 dG(A) will ensure the validity of (3.2.1). In the 
light of the discussion in section 3.1 restricting attention to prior 
distributions with finite variances when considering EBEs given by 
(3.1.1) seems entirely reasonable. 
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Consider the set of x-values such that <5 G (x)e(L, [/), and, if 
necessary, a subset contained in an x-interval of finite length. Suppose 
such an interval to be I(L, U). Then, since <5„(x)-x5 G (x), (P), for all 
xe/(L, U) we can state that 


I 


xeHL.V) 


{<5 b (x) - S G (x)} 2 dF G (x) -» 0, (P). 


Also 


{6„(x)-6 G (x)} 2 dF G (x) 

J xtl(L.U) 

< f {<5 G (x)} 2 dF G (x). 

J xtHL,U) 

Therefore, noting condition (3.2.1) we can choose J(L, U) so that 
W(6 h )->W( 6 g ) + E ,(P), 
where e is arbitrarily small. 

Arguing similarly, using £„ {£,(*) — <5 G (x)} 2 ->0 for x such that 
<5 G (x)e(L, U), we can show that 


E K W(S n )^W(S G ) + e\ 
again with e' arbitrarily small. 


Example 3.2.1 If the distribution of X is Poisson (2) we have 

= (*+ 1)Pg( x + 1)/PgM 
and it is readily seen that 

<5„(x) = (x + l)/„(x + 1)/{1 + f„(x)} 
given by (1.9.1) is a consistent estimate of 5 G (x) for every x. 


3.2.2 Consistent estimation of an approximation to <5 c (x) 

Approximations to the Bayes estimate have been discussed in sections 
1.12 and 2.7. Let <5g(x) be an approximation of <5 G (x) having the 
property that W(d%) — W{d G ) = A, where A is usually small. Suppose 
that S* is an empirical version of 5%, i.e. an estimate of 5% such that 
<5*(x) -><5£(x),(P) for every x. Then by arguing as in section 3.2.1, in 
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particular by truncation of S* if necessary, we can show that 
E n W{S*)~* W{8 g ) + A + e 

and refer to this property as A-asymptotic optimality. See also 
Rutherford and Krutchkoff (1969). 

Example 3.2.2 Suppose that X = Poisson(A) and A = (7(0, A). Then 
<5 G (x) = (x + \)I x+ i/I x where 

I x = 1 - e~ A {l + A + A 2 /2\ + ••• + A x /x\). 

Let <5£(x) be the linear Bayes estimator as given in Example 1.12.1. 
Substituting £(A) = d/2, var(A) = d 2 /12, 

<5*(x) = (3 + x)/( 1 + 6/d). 

Suppose that A = 2X where X is the mean of the past x-values. 
Then A -* A, (P ) and <5*(x) = (3 + x)/(l + 6/d) -*• <3£(x), (P), for every x. 
Therefore 8* is A-asymptotic optimal, the actual value of A depending 
on d. For example, if d = 3, <5J(x) = 1 + x/3, W(b G ) — 0-4725, 
W(8%) — 0.5, A = 0.0275. 


3.2.3 The rate of convergence of E„W(8„) 

If 6„ is a.o. (£) the rate of convergence of E„W{d n ) will clearly depend 
on the rate of convergence of 5„(x) to S G (x) or E{8„{x) — S G (x)} 2 to 
zero. A very common situation corresponds to well established 
practice in point estimation where £{<5„(x) — <5 G (x)} 2 = 0(l/n). The 
majority of regular estimation problems allow this sort of statement, 
but we shall indicate some exceptions relevant to EB estimation. 

Now, if it is true that SJx) is a consistent estimator of <5 G (x) with the 
property that for each x, 

E{S„(x) — <5 g (x)} 2 = 0(l/n), (3.2.2) 

then E n W(S n ) = W(d G ) + 0(l/n). This can be established by arguing as 
in section 3.2.1, and, if necessary, truncating 8„(x) as before. 

Example 3.2.3 Consider the Poisson case again, as in 
Example 3.2.1. The joint distribution of f„(x) and f„(x + 1) is 
trinomial (n, p G (x), p G (x + 1)} and by using Taylor type expansions it 
is readily established that (3.2.2) holds. 
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The rate of approach to A-asymptotic optimality can be dealt with 
in similar fashion. Suppose that (5*(x) is a consistent estimator of <5*(x) 
satisfying a relation like (3.2.2). Then E n W(8*) = W(d G ) + A + 0(l/n). 

Example 3.2.4 Refer to Example 3.2.2 and note that A is an 
unbiased estimator of A with variance A 2 /3n, S J(x)(3 + x)A/(A + 6), 
S$(x) = (3 + x)A/(A + 6), and straightforward calculations show that 
(3.2.2) holds. The distribution of A is asymptotically N{A,A 2 /2in) and 
the distribution of <5*(x) is asymptotically N(6%(x); l2A 2 /n(A + 6) 2 ) 
(Serfling, 1980, p. 118). 

Examples of EBEs for which (3.2.2) does not necessarily hold occur 
when the distribution of X is continuous and the EBE is expressed in 
terms of estimates of / G (x) and / G (x). Typically kernel estimates of 
/ G (x) and / G (x) may be used. They are discussed briefly in sect¬ 
ion 3.4.6. The rates of convergence of these estimates depend on 
factors such as the choice of window width. 

3.3 Robustness with respect to the prior distribution 

The question we address is: suppose the prior distribution is G and 
belonging to a class but in attempting to estimate G it is taken to 
belong to a class Suppose that the estimating procedure is such 
that a G* is obtained as an estimate of G* which is, in some defined 
sense, the closest member of to G. Then S$, is an estimate of <5 G ., 
the latter being an approximation to d G . Can something be said about 
the magnitude of A = W(d G .) — W{d G )1 

Before attempting to answer the question we note that the notion of 
A-asymptotic optimality was used in section 3.2.2; obviously one 
would say that an EB estimation procedure is relatively robust if it 
guarantees A-optimality with A small. The smallness of A is relative to 
W{8 0 ). The notion of A-asymptotic optimality is used also in 
connection with procedures such as linear EB estimators, based on 
linear Bayes estimators. In a given problem there is not necessarily a 
class such that <5 G . is linear in x. Thus, while the ideas of robustness 
and A-asymptotic optimality have much in common they also have to 
do with slightly different facets of EB estimation. 

In section 3.1 we argued in favour of relatively non-diffuse priors in 
the context of EB estimation, and shall adopt that approach here. We 
begin by assuming, as we shall do throughout this chapter, that A has 




A 


Fig, 3.1 
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finite variance and that we can set two limits X L , X v such that 
P{X L < A < X v } = 1 — e with e small. Crudely we can now regard 
calculation of the Bayes estimate according to formula (3.1.1) as 
averaging Xf(x\X) and f{x\X) w.r.t. to G(A) truncated over a window 
whose endpoints are X L and X v . 

Assume that 


1. G* and G have identical first, second and third moments; and 

2. over the interval (A t , it is possible to approximate /(x|A) by a 
quadratic for every x. In Fig. 3.1 f(x | X) is depicted for two values of 
x, x' and x". For each of these the approximation of f(x{X) by a 
quadratic in X would be quite reasonable. Thus 


and 


/(x|A) A x0 + A xl X + A x2 X 2 = /(x|A) 


4w= 


j Xf(x\X)dG(X) 
ff(x\X)dG(X) 


*S G (x). 


Now, by assumption 1 of equality of the first three moments, 

c5 g .(x) - <M*) = ^gW - <M*)- 


In these conditions A could be quantified in terms of the maximum 
difference between f(x\X) and f(x\X), but the discussion shows 
qualitatively that a high degree of robustness w.r.t. the choice of G* 
can be expected with some modest regularity requirements of /(x|A) 
as a function of X. 


Example 3.3.1 Let X = N(X, 1), A = N( 0,1/9) and consider three 
classes of distributions #*, r= 1,2,3 from which an approximation 
for G might be chosen. 

uniform; triangular; @$:g(X) = (l/ff)exp( — \X — p\/<r). 

Then, adjusting the parameters so that the first two moments are the 
same as those of the true G, 


G? is U 


( Ji'J)) 


G* is triangular (-^2/3, +^2/3) 

G* has p.d.f. g*(X) = (3/ y/2 )exp {— 13^/2A|}. 
Typical results are shown in Table 3.1. 



SIMPLE EB ESTIMATES 


71 


Table 3.1 


X 

<5 0 (x) = x/10 


s ai(x) 


0 

00 

0 

0 

0 

l 

0-1 

01539 

00963 

00903 

2 




0-2024 

3 

0-3 

0-3649 


0-3631 

4 

0-4 

0-4219 


0-6291 

W( ) 

01000 

0-1011 

0-1000 

0-1001 


These results, especially the values of W'(-) for the different 
approximations, support the expectations of reasonable robustness 
w.r.t. G. 

Similar robustness studies have been reported by Rubin (1977). 
Some more details of approximations of Bayes estimators by 
estimators of type <5 G , will be given in later sections dealing with 
particular data distributions. For further literature on robustness in 
this context the reader is referred to Berger (1986). 

3.4 Simple EB estimates 

3.4.1 Introduction 

Conceptually the most direct way of obtaining an EB estimate is by 
constructing an estimate, G, of the prior G and then replacing G by G 
in the formula for d G . In special cases it turns out that S G can be 
estimated without explicitly estimating G. The Poisson example 
of section 1.9 is a case in point. An apparent advantage of adopting 
such a procedure, where possible, is that it is distribution free w.r.t. 
G. No assumption is needed about the class of distributions to 
which G might belong, or in which a good approximation to G might 
be found. A disadvantage of these estimators is that they are 
typically not smooth. 

We shall refer to all EB estimates which are derived without explicit 
estimation of G as simple EBEs. Simple EBEs are obtainable 
whenever the Bayes estimate can be expressed in terms of the 
probabilities or density of the marginal distribution F G , or transforms 
of these. 
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3.4.2 The discrete exponential family 


Let 

p(x|A) = A*B(x) exp {/1(A)}, x = 0,l,2,... (3.4.1) 

This is a somewhat special form of the exponential family of discrete 
distributions. The Poisson distribution can be written as in (3.4.1) 
with B(x) = 1/x!, /1(A) = — A, but in its natural parametrization the 
binomial distribution does not take the form (3.4.1). 

Substituting p(x|A) in (3.1.1) gives 

<5 g (x) = {B(x)/B(x + l)}p G (x + l)/p G (x) (3.4.2) 

as the Bayes point estimate of A. Now, if /„(x) of the past observations 
x 1( x 2 ,..., x„ have the value x we can estimate p G (x) and p G (x + 1) by 
(1 + /„(x))/(n + 1) and /„(x + l)/(n + 1) giving the EBE 


5 n (x) = {B(x)/B(x + l)}/„(x + 1)/{1 + /„(x)}. (3.4.3) 


It is a simple EBE because £ G (x) depends only on the marginal 
probabilities p G (x) and p G (x + 1) which can be estimated as indicated 
above. 

The joint distribution of /„(x) and f„(x + 1) is trinomial, as noted in 
Example 3.2.3, and <5„(x) is clearly a consistent estimate of <5 G (x). 
Therefore <5„(x) is a.o. ( E ) in the slightly restricted sense of the 
argument in section 3.2. Moreover, calculation of E„ W{b„) is in 
principle straightforward, although it might be tedious in special 
cases. Calculations can be simplified somewhat by using 


E n 

E n \ 


f /„(*+!) ] 
ll+/„(*) J 

/.( x + 1) ] 2 

1 + fn(x) J 


Hx 

= n(n — l)ql+ 1 X - ~ -^B{n-2,q x ,r) 
r=o [r + l; 

»- 1 1 

+ nq x+ 1 £ — T B(n- l,q x ,r) 

r = 0 T + I 


where q x = p G (x), B(n, 6, r) = j 6 r {l - 0) n r , r = 0,1,2,, n. 
For large n one can use the approximation 


[ L(x + i) 

Pg(* + 1)1 2 

11 + fn( x + 1) 

Pg(x) j 


- {p G (x + 1 )p c (x) + p|(x + l)}/{np G (x)}. 


(3.4.4) 



SIMPLE EB ESTIMATES 


73 


Some standard examples of distributions of the type (3.4.1) are the 
Poisson, geometric, negative binomial, as given in section 1.3. 
Numerical results for E n W(S n ) in some special cases are given in 
section 3.7. 

3.4.3 The continuous exponential family 

Consider the special form of the continuous exponential family of 
distributions with p.d.f. 

/(x|A) = exp {XA(x) + B{X) + C(x)}, (3.4.5) 

for which 

{df{x\X)/dx}/f(x\X) = aA’(x) + C'(x) 
and 

1 = {l/A'(x)}l{df(x\X)/dx}/f(x\A) - C'(x)]. 

Substituting this form of A in (3.1.1), assuming that we can write 
f'o(x) = j{df(x\X)/dx}dG(X), 

5 g (x) = {1 /A'(x)} {f' G (x)/f G (x) - C'(x)}. (3.4.6) 

As an important special case recall Example 1.3.7, the normal 
distribution where A(x) = x/o 2 , B(x) - - A 2 /2 a 2 , C(x) = - x 2 /2a 2 + 
In (o^/2a), giving 

<5 c (x) = x + <r 2 /c(x)// c (x). 

Now let /„(x) and f' n (x) be estimates of the density / G (x) and its 
derivative f' G {x). These estimates are obtainable as explained in 
section 3.4.6 as functions of the past x-observations without any 
reference to the possible form of G. Consequently 

<5„(x) = {1 /A'(x)} (/;(x)//„(x) - C'(x)} (3.4.7) 

is a simple EBE of A. The estimators /„(x) and f' n ix) are somewhat 
more complicated than the straightforward estimators of proba¬ 
bilities p G (x) in the discrete case. But, if they are consistent, asymptotic 
optimality can be established as in section 3.2. Rates of convergence 
of E n W(S n ) have been studied by Lin (1975), and found to be generally 
slower than 1/n. 

3.4.4 General construction of simple EB estimators 

Two examples are useful for motivating the approach of this section. 
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Example 3.4.1 


X = Bin (n, 2), so that 


p(x|A) = ( )2*(1 -X) n 

L X J 


= exp jln(”) + xln(-j-2-^) + nln(l — A)|, 


verifying that this distribution is also a member of the discrete 
exponential family. To put it in the form (3.4.1) we have to 
reparametrize setting 6 = 2/(1 — 2), in which case we can obtain a 
simple EBE of 6, but not of 2. In other words we can obtain a direct 
estimate of £(@|x), but not of £(A|x). 


Example 3.4.2 Let /(x 12) = (l/2)e x/x , x > 0, and 0 otherwise. Then 

J f(u\X)du — 1 — e~* ,x 
or 

e -xM_l_J f(u\X)du. 

Substituting in (3.1.1) we obtain 

<5 g (x) = {1-£ c (x)}// g (x). (3.4.8) 

In order to use (3.4.5) we would have to reparametrize to 9 — 1/2 and 
the Bayes estimator of 9 is expressed in terms of f G and / G . The form 
(3.4.7) has two advantages, one that in some circumstances 2 rather 
than 6 may be the natural parameter to estimate. Another is that 
estimation of £ G is simpler than estimation of f' G . 


The two examples suggest that it may be worthwhile to contemp¬ 
late more general operations on £(x|2) in order to construct EBEs. 
Also, one need not necessarily consider estimating 2 directly. A 
function of 2, or even a function of 2 and x could be estimated. 

Let T l and T 2 be operators on distribution functions such that 
T r £(x|2) = /t r (x|2)/(x|2), r = 1,2. Then 


Ti£ c (x) £ a , x {Mx,A)} 
T 2 F g (x) E Ax {h 2 (x, A)} 


(3.4.9) 


where £ A-X is expectation w.r.t. the posterior distribution of A given x 
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(see Maritz and Lwin, 1975). The method of implementation of (3.4.8) 
in simple EB estimation relies on the fact that a(x) can be estimated 
from past observations because both TjF g (x) and T 2 F G (x) are just 
properties of the marginal -distribution. 

There are two problems, one to do with the details of estimating 
a(x), but as we shall show by example, the greater freedom of choice of 
T r can make estimation of a(x) relatively straightforward. The other 
problem can be discussed somewhat more effectively if we simplify 
(3.4.8) by letting T 2 be 8/8x so that T 2 F(x |1) = /(x | X), i.e. h 2 (x, X) = 1. 
Then 


= A)}. (3.4.10) 

So, the estimate of a(x) provides an estimate of the posterior mean 
of /i x (x, A) whereas we actually want the posterior mean of A, i.e. 
£a,x(A) = W 

A first approximation for (5 G (x) can be taken as the solution of 
a(x) = li 1 (x,^ G (x)} 

and an improved approximation may be obtained as the solution of 

where al is the posterior variance of A. In EB estimation a(x) is, of 
course, replaced by its estimate tf(x). 

Example 3.4.3 Suppose we let T t B(x) = ^ udB(u), T 2 B(x) = J3(x). 

Then a(x) = § x _ x udF G (u)/F G (u), the mean of the X G distribution 
right-truncated at x. A natural estimator of ot(x) is obtained from 
order statistics x (0 , i = 1, 2,.. ., n as 

r 

*( x (r)) = Z X <i)/ r ' 
i = 1 

This example shows that density estimation problems can be avoided, 
although other problems, relating to the use of (3.4.10) may be 
introduced. 

Examples of other operators T l and T 2 applied to certain types of 
distributions are to be found in Rutherford and Krutchkoff (1969), 
Nichols and Tsokos (1972) and Cressie (1982). 
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3.4.5 Smoothing of simple EB estimators 

Most Bayes estimators, <5 G (x), are smooth functions of x in some 
obvious sense. Whether X is continuous or discrete, d G (x) is usually 
monotonic in X, and if X is continuous <5 G (x) is usually also 
differentiable w.r.t. x. For example, in the Poisson A'-gamma A case 
<5 G (x) = (/? + x)/(a + 1) as in (1.3.7), and in the normal Abnormal A 
case <5 G (x) = (x/<t 2 + p G /a G )/(l/a 2 4- l/(x G ). But, as we have pointed 
out before, <5„(x) need not be smooth. For example, in Table 3.2, values 
of d„(x) are shown for a particular set of data generated by a Poisson 
Af-gamma A model. A plot of <5„(x) against x would produce an 
irregular graph as in Fig. 1.1. 

Since <5 G (x) is generally smooth it seems sensible, when a non¬ 
smooth <5„(x) has been obtained, to smooth it by fitting a straight line 
or some other curve through observed d„(x) values. Methods for such 
direct smoothing of simple EBEs, <5„(x), have received little attention, 
and we limit this discussion to just one suggestion, put forward in 
Maritz (1967) for discrete X in the exponential family. 

Suppose that S G (x) = A + Bx. Then fitting of a straight line 
5*(x) = A* + B*x would be indicated by the data. A method of 
fitting is suggested by the following: 

£{<5„(x)-<5 G (x)} 2 p G (x) 

= I - <5?(x)} 2 p c (x) + £ (<5?(x) - <5 G (x)} 2 p G (x) 

+ 2£(<5„(x) — A* — B*x){(A* — A) + (B* - B)x}p G (x). 

(3.4.12) 

By setting the third term in (3.4.12) equal to 0 we can ensure that 
!F(dJ)4 W(<5„), and this can be done by letting A* and B* be the 
solutions of 

r i z x p g (x) y a *~\ , r £<5 „(*)pgm i , 3 . 

[_Xxp G (x) 'L* 2 Pg(x) J LE^.WPgWJ' 

Clearly (3.4.13) cannot be applied directly in practice since the p G (x) 
values are not known. Either estimates using the observed frequencies 
/„(x) can be used to replace p G (x) values, or smoothed approxi¬ 
mations can be used in an iterative scheme. Begin by fitting a straight 
line by eye or using estimates of p G (x) based on /„(x), x = 0,1,2,.... 
Let this line be y 1 * 01 = A <0) + £ (0) x. Then using formula (3.4.2) in 
reverse, calculate numbers p°(x) proportional to the estimated p G (x) 
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according to 

p o (l) = p°(0)y o o \ p°(2) = ip°(0)y o o y°>,etc. (3.4.14) 

Substituting p°(x) for p G (x) in (3.4.12), new values A (1 \ B {1) can be 
calculated, etc. 

Another method of fitting is indicated by (3.4.14). Dividing the trial 
p°(x) by a suitable factor fitted probabilities, estimates of the p G (x ) are 
obtained. The likelihood of the past observations can then be 
calculated as 

L= flp<°»(x f ), (3-4.15) 

i — 1 

a function of A i0) and B {0) . Applying the maximum likelihood 
principle, A <0] and B {0) are adjusted to maximize L. 

Both methods described for fitting a straight line can be extended in 
an obvious manner to fit higher-order polynomials. 

An important contribution to the idea of smoothing of simple (or 
any EB) estimators was made by van Houwelingen (1977) for the 
exponential family of discrete distributions. Starting with an arbitrary 
estimator d(x), a monotonized version d*(x) of d{x) is constructed 
having the property that RJ(2) < Rj(A) for all A, i.e. that d* dominates 
d. Following van Houwelingen let d(x) be a randomized estimator 
represented by a distribution function D(a, x). Thus if x is observed an 
estimate a generated by D(a, x) is made. 

The construction of the estimator d* is according to the following 
steps: 

R(x|A)= t p(x|A); P(— 1 |A) = 0 

y = 0 

where p(x| A) is given by (3.4.1); 

i?(a) = ££>(a,x)p(x|a) 


fo 


D*{a, x) = <( 


{&{<*) — P(x — 11 fl)} 
P(x I a) 


u 


adD*{a, x). 


d*(x) = 


A 


if &(a) < P(x — 11 a) 

if P(x — 1 |a)< J?(a) < P(x|a) 

if P(x|a) < JSf(a) 
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van Houwelingen (1977) shows that d*(x) is monotonic in x and 
dominates d(x). Particular examples of the application of this 
procedure to simple EBEs suggest that the improvement in perfor¬ 
mance of d*(x) relative to d„(x) can be quite dramatic. There is, of 
course, no guarantee that d*(x) will necessarily be better than some 
other smooth EBE, but the theoretical, and practical, advantage of 
van Houwelingen’s method is that d*(x) is guaranteed to be better 
than the simple EBE <5„(x). An example is given in section 3.7. 


3.4.6 Density estimation 


Let x 1 ,x 2 ,...,x m be independent observations on a continuous 
random variable X with distribution function F(x) and density 
function /(x). The kernel density estimate of /(x) proposed by Parzen 
(1962) is /„(x) given by 


/„(*) = 


1 " 

— f k 

nh(n) j=i 



where K is typically chosen to be one of the standard density 
functions, for example, a N( 0,1) density. The divisor h(n) is called the 
window width, and for consistent estimation we must have h(n) -> 0 
and nh(n) -»oo as n -* oo. 

Density estimation is an important topic in several branches of 
statistics and much further useful information will be found in 
Silverman (1986). 


3.5 EB estimation through estimating G 

3.5.1 Introduction 

A good deal of the literature on EB estimation has been devoted to 
simple EBEs, i.e. where no knowledge of G is assumed. In some cases it 
would seem perfectly reasonable to suppose that the type of G is 
known, i.e. the family & of distributions to which G belongs. Then, if G 
is indexed by a low-dimensional parameter vector, the marginal X- 
distribution F G (x) is similarly indexed, its form is known, and 
estimation of G becomes a standard problem of parameter 
estimation. 

Even if it is agreed that knowledge of the exact parametric form of G 
will rarely be available there is value in examining such parametric G 



EB ESTIMATION THROUGH ESTIMATING G 


79 


EBEs. As we have remarked, they have a place of their own, but aside 
from that, knowledge of the form of G can perhaps be regarded as the 
most favourable state in which one can be. It is, then, useful to 
compare the performance of other EBEs with their parametric G 
counterparts. 

In earlier discussions it has also emerged that Bayes estimators 
could be approximated by choosing a G* belonging to a family ( S*\ 
see sections 1.12, 2.7, 3.3.2. A particular family of some interest is 
comprising step-functions with k steps. One of the advantages of 
obtaining approximate EBEs of this sort is that they are automati¬ 
cally smooth, and some of the problems of simple EBEs are avoided. 

3.5.2 G belonging to a known parametric family 

Suppose that G(A) = G(A; a, /?) of known form. Then the mixed 
distribution F G (x) = F(x; a, p). The past observations x l ,x 1 ,...,x„ 
now enable one to calculate estimates d,/?,... of the parameters 
a, ft,... by any one of the standard methods of estimation. The Bayes 
estimate can be expressed as <5 G (x) = <5(x; a,/?,...) and the EBE is 
<5 G (x) = <5(x;d,/?,...). 

In most standard conditions the usual results for estimation will 
apply to S G . If A, P are consistent for a, pin the sense that &,P~*a, p(P) 
as n -»oo we have asymptotic optimality in the sense of section 3.2. 
We shall also have E n {8 G {x) — 5 G (x)} 2 = 0(\/n) in most standard 
situations, so that E„ W(d G ) — W{5 0 ) + 0(l/n), with possibly some 
moderate restrictions on S G as in section 3.2. 

Example 3.5.1 Let X = U(X — 1, A + 1) and A = U(A — l,A + 1). 
Then <5 G (x) = (x + A)/2. The marginal X-distribution has 

. {&)(x-A+ 2), A-2^x^A 

Jg[X) j(i)(,4-x + 2), A^x^A+2 

i.e. a triangular density centred at A. An obvious estimator of A 
is A = X, the sample mean, which is unbiased with variance 2/3 n. So 
<5 g (x) = (x + A)/ 2, and 

- S G (x)} 2 f G (x)dx = E„l(A - A) 2 = l/6n, 

JA-2 

so, E„W(S g ) = W(S G ) + l/6n. 
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In more realistic examples the calculation of £„ W(S G ) is less 
straightforward, and compact formulae for E n W($ G ) appear to be 
obtainable only as approximations for large n. 


Example 3.5.2 X = N(X, 1), A = N(p a , a G ). 


(i) <r G known: X G = N(p G ,a G + 1) and we estimate n G by the 
sample mean x of past observations. Then 

<5 g (x) = (x + x/a G )/(l + 1 /ol). 


(3.5.1) 


and 


£„ 


f X + x/o G 

x + p G /al\ 

U + l/crS 

1 + 1/<tS J 


f G (x)dx = 


1 


n{a G + 1) 


(ii) o G unknown : Since var (X G ) = o G + 1 we estimate o G by & G = 
max(0,s 2 — 1), where s 2 is the sample variance of past observ¬ 
ations. Now 


x + x/d 2 x + p G /o 2 G } 2 

TTT/ffij fa(x}dx 

(#g ~ ffS) 2 


i + i/d| 


+ E n 


(l+a 2 )(l+d 2 ) 


2 \ 2 - 


(3.5.2) 


For small n evaluation of the r.h.s. of (3.5.2) is numerically fairly 

- d 

straightforward, using the fact that (n—l)s /(l + o G ) = x„-i and 
observing the condition & G = max (0, s 2 — 1). Some results are given in 
section 3.7.3. For n-»oo, P{s 2 — 1 < 0) -*• 0 and one can evaluate 
(3.5.2) approximately by replacing 1 + d G by s 2 and ignoring the 
truncation. Then E n (l/s 2 ) = (n — l)/(n - 3)(1 + <r G ) and £(l/s 4 ) = 
(n - 1 ) 2 /(n - 3)(n - 5)(1 + o 2 G ) 2 , giving 


E„W($ g )~ 


f 1 / (n — l) 2 \ 2n + 6 

in \(n - 3)(n - 5)J + (n- 3)(n - 5) 


(1 + o 2 G )+W{d G ). 

(3.5.3) 


In Examples 3.5.1 and 3.5.2 estimation of the parameters of 
E(x; a, fi,.. .) is straightforward and calculation of E n W(S G ) fairly easy. 
Generally estimation of a, /?,... will usually be by the method of 
maximum likelihood or some other standard procedure. If the ML 
estimates of a, /?,... are then <5(x; .) is the MLE of 3 a . For 

large n one can express E„W(§ C ) approximately in terms of the 
elements of the information matrix of estimating 
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3.5.3 Approximation of G by a parametric distribution 

Suppose that the form of G is unknown and that it is decided to 
approximate it by a member of a family of distributions. 
According to the discussion of section 2.7.1 the best approximating 
G*(A; a,/?,...) is that member of for which 
• 

lnf G .(x;ct,p,...)dF G {x) 

is maximized. The empirical version of G* is G*(X ;where 
A,p,... are obtained by maximizing 

Y 4 \nf G ,{x i ;a,b,...) 

w.r.t. a, b, _In other words, we act as if the marginal ^-distribution 

has density / G *(x;a,/?,...) and a,/?,... are estimated by the ML 
method. The approximate EBE S G , is given by (3.1.1) with G replaced 
by G*(A; A, ...); it can be written <5 0 .(x) = <5*(x; A, 

As n-* oo, 5*(x; A,fi,.. .)-><5*(x;a,/?,...) and the goodness of 
<5*(x; <$,/?, . . ) will depend on the goodness of the approximation of 
6 G (x) by <5*(x; a,/?,...). The questions of A-asymptotic optimality and 
robustness discussed in sections 3.2 and 3.3 are relevant here. As we 
have seen in section 3.5.1, calculation of E n W(§ G ) can be difficult, and 
the same applies here. It is possible to obtain large sample formulae 
for E n W(§*). 

Let us consider the case of G* depending on only one parameter, i.e. 
G*(/; a). Modification for more than one parameter is fairly straight¬ 
forward. The estimate of a is the solution a = A of 


For n large. 


S(\;a) = £ 

i 


var (A) 


8 In / G .(x,.; a) 
da 

(3.5.4) 

var {S(X; a)} 
c>£S(X;a)) 2 ’ 

(3.5.5) 


da ) a =„ 


where a is the solution of 

' 8 In / G .(x; a) 
da 


f G {x)dx = 0. 


For a justification of (3.5.5) see, for example, Maritz (1981). Now, 



82 


EMPIRICAL BAYES POINT ESTIMATION 


write h*(x; a) = 8 In / G .(x; a )/dix. Then 


var{S(X,a)} = n 


{h*(x; a)} 2 f G (x) - <1 h*(x; a)f G (x) 


and 


Finally, 


f3£S(X,a)j _ f C8h*(x, 

i da "IJ du 


a) 


var {<5*(x; 


;<*)} - j 


86* (x, a) 
da 


f G (x)dx 


var(<«), 


3.5.6) 

(3.5.7) 


which can be used to compute the value of E n W(S G .) approximately. 
From the results given above we obtain 

E n W{S G .) = W(6 a ) + A + 0(l/n). 


3.5.4 Step-function approximation of G 

The idea of approximating G by a step-function is discussed in 
section 2.7. Here we shall consider only the approximating G*(A) 
having k steps of equal size l/k at A 1( A 2 ,...,A*. A method of 
determining G k is described in section 2.9. 

When using G k in EB estimation it seems desirable to keep k fairly 
small especially for small n. Numerical examples have indicated that 
very small values like k = 3 can give quite adequate approximation of 
the Bayes estimator. This is in accord with the discussion of 
robustness of Bayes and EB estimators in section 3.3. In particular, if 
G is symmetrical it is clearly possible to have exact agreement of the 
first three moments of G and G k . 

The approximate Bayes estimator given by G k is 

•KM = i A/(X | kj) It fix I kj) (3.5.8) 

i= i / J -1 

and its goodness is measured by W(i K) = W(<5 G ) + A k . As we have 
seen before, we can obtain a sequence A t ->0 as k-> oo. Numerical 
values of W(\l/ k ) for certain special cases are given in section 3.7. 

In EB applications k 1 ,k 2 ,...,k i are estimated from the past data. 
This can be done, for example, by the distance-minimizing method 
which in effect treats the x-values as having been generated by F Gk (x) 
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and using the ML method. Other methods implying the use of other 
distance measures can be used, for example the method of moments, 
which is easily implemented when k is small. According to the pseudo- 
ML method X 1 ,X 2 ,...,X k are obtained by maximizing 
Z"=iln{(l//c)^ =1 /(x,|A J )} w.r.t. A 1( A 2 ,...,A„. The EBE, i£*(x), is 
given by (3.5.8) with lj replacing A j. We can also write 


n 


I In 

f = 1 




f(x\Xj)dF n (x) 


(3.5.9) 


and note that F„(x) -* F G (x), ( P ), for every x as n -► oo. Hence, by the 
arguments of section 2.7 Xj -> Xj, (P), as n-> oo. In other words, 
is a consistent estimate of Aj, and i j/ k (x) is a consistent estimate of 
>p k (x). Following the arguments in section 3.5.2 we have 

E n W($ k )=W(8 a ) + A k + 0(\/n). 


3.6 Linear EB estimation 

An outline of linear Bayes estimation is given in section 1.12.1 and 
(1.12.2) summarizes the method of calculating a linear Bayes es¬ 
timator. In the l.h.s. matrix of (1.12.2) we have elements 


E(X' a ) 


E(X r \X)dG(X), 


r = 1,2, 


both of which can obviously be estimated simply by the first two 
sample moments of the observations x 1 ,x 2 ,...,x n . 

In general estimation of the elements of the r.h.s of (1.12.2) is more 
difficult although it can be easy in special cases. Let us suppose we can 
find functions U k (X), U 2 (X) such that 


J U l (x)f{x\X)dx = A 

f (3-6.1) 

U 2 (x)f(x | X)dx = i?(A) where if (A) = A£(X | A). 

Then the r.h.s. elements of (3.6.1) are estimated by 

X U r (Xi)/n, r = 1,2. 

1=1 

Clearly, solution of the integral equations (3.6.1) is non-trivial, but 
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certain forms of /(x j X) do lead to straightforward estimation. Very 
often the distribution f(x | A) can be parametrized so that E(x\X) = X, 
giving E£{X) = X 2 so that the choice of U l and t/ 2 is sometimes 
obvious. 

d 

Example 3.6.1 X = Poisson(A). In this case 
&{X) = XE{x \X) = X 2 

and we can take lT 1 (JSf) = X, U 2 (X) - X 2 — X. Straightforward cal¬ 
culations show that the linear Bayes estimator is 

«g(*) = i{E{X G )} 2 + x{var(X G ) - E(X c )}]/var(X G ), (3.6.2) 

with the moments of X G estimated directly from the observations 


3.7 EB estimation for special univariate distributions: one current 
observation 

In each of the examples of this section the natural conjugate 
distribution is used to generate numerical data, and in the parametric 
G EB estimation, for the most part only the family of conjugate priors 
is considered. This is done partly for simplicity, and partly because the 
conjugate families are thought to be flexible enough to provide good 
approximations to the actual G in many applications. 

3.7.1 The Poisson distribution 

We begin this section with an example, based on the Poisson 
distribution, illustrating some of the methods of constructing EBEs, 
and the calculation of W(-) values. Table 3.2 shows in the second 
column observed frequencies f n (x) generated by a Poisson distri¬ 
bution mixed by a gamma (a,/?) prior G, with p.d.f. given as in 
Example 1.3.2. For these data x = 5 00, s 2 = m 2 = 912. 

(a) Parametric G, *§ known to be gamma (a,/S): estimating a,/? by 
the method of moments using E(X G ) = /?/a, var(X G ) = /?/a 4- /S/a 2 we 
obtain a = 1-214, /S = 6-068. 

(b) Parametric G, ^ known: ML estimation of a,/S: 

A = 1-310, p = 6-548 
5 a (x) = 2-835 + 0-433x. 
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Table 3.2 Observed frequencies f n (x) generated by the Poisson-gamma model 
when G is gamma(2,10), Columns headed (c), (d), (f), (g) are EBE's 
derived by the methods given in the text 


X 

m 

Pg( x ) 

(c) 

Moments 

(d) 

(f) 

(g) 

<5 g M 

0 

_ 

0017 

3-46 

300 

1-34 

3-42 

3-33 

l 

3 

0058 

3-51 

400 

2-20 

3-42 

3-67 

2 

8 

0106 

3-60 

3-33 

2-80 

3-42 

4-00 

3 

10 

0141 

3-78 

0-73 

3-33 

3-42 

4-33 

4 

2 

0153 

412 

18 33 

3-66 

5-00 

4-67 

5 

11 

0143 

4-66 

200 

4-75 

5-00 

5-00 

6 

4 

0119 

5-40 

5-60 

5-62 

5-00 

5-33 

7 

4 

0091 

615 

000 

6-40 

5-46 

5-67 

8 

— 

0064 

6-74 

9-00 

7-88 

6-99 

600 

9 

1 

0043 

7-12 

1000 

8-75 

7-75 

6-33 

10 

2 

0027 

7-30 

14-67 

9-29 

7-95 

6-67 

11 

4 

0016 

7-41 

0-00 

9-79 

7-95 

700 

12 

— 

0010 

7-45 

1300 

10-16 

8-18 

7-33 

13 

1 

0005 

7-48 

000 

10-76 

8-18 

7-67 

14 

— 

0003 

7-50 

0-00 

11-58 

8-18 

8-00 

15 

— 

0002 

7-50 

000 

12-32 

8-18 

8-33 

16 

— 

0001 

7-50 

0-00 

12-84 

8 18 

8-67 

m ) 



1-90 

4104 

3-38 

2-08 

1-67 


(c) Step-function approximation of G: three methods were em¬ 
ployed for estimating A x , A 2 ,A 3 in G 3 . The ML method requires 
maximizing 

L n W = £ /„(*) In ji £ exp (— Ajitf/x ! 

x (. 7=1 

w.r.t. < X 2 < A 3 . The resulting estimates are 

Xj = 3-57, X 2 = 3-58, X 3 = 8 14. 

A minimum x 2 method described in Maritz (1967), using grouping 
of the frequencies in Table 3.2, gave 

Xj = 3-3, X 2 = 3-6, X 3 = 7-5. 

The method of moments was used, choosing the A’s so as to equate 
the first and second moments of the observed and fitted X G - 
distributions while minimizing the absolute difference of third 
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moments: 

Xj = 3-50, J 2 = 3-59, X 3 = 7-87. 

The results in column (c) of Table 3.2 are based on the minimum x 2 
estimates. 

(d) The simple EBE 

$n( x ) = (*+ l)f n (x + 1)/{1 + f„(x)}. 

(e) A smoothed version of (d), fitting a straight line by the 
weighted least squares method motivated by (3.4.13). Starting with 
an eye-fitted line 

y (0) = 2-65 + 0465x 
one iteration gave the result 

y (1) = 2T5 4- O760x. 

(f) A monotonized version of <5„(x) according to the method of van 
Houwelingen (1977) described in section 3.4.5. 

(g) Monotonic ordinate rj x ,x = 0,1,2,... fitted by the ML method 
suggested by (3.4.15): expressing p°(x i ) values in terms of trial 
ordinates rj x and inspecting the partial derivatives of In L w.r.t. the rf s 
shows that r\ x should be constant for all x such that o fn( r ) — 0 and 
ZX=x +1 fn( r ) = 0- Then maximize In L subject to rj 0 ^ rh — 

(h) Linear EB: using (3.6.2) the linear EBE is 

<5(w 0 , wq) = 2-741 + 0452x, 


Table 3.3 Expected losses W( ) for EB estimators based 
on the data of Table 3.1 and for Bayes and best non-Bayes 
estimators 


Method 

W( ) 

Parametric G: moments 

1-77 

: ML 

1-74 

G 3 : moments 

1-90 

<5„ 

4104 

<5„ smoothed: straight line 

1-74 

monotonized (van Houwelingen) 

3-38 

Non-decreasing ordinates: ML 

2-08 

Best non-Bayes: T = x 

500 

Bayes 

1-67 
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which is the same as (5 G (x), as it should be in this special case. 

The values of some of the estimators are listed in Table 3.2. 
Unlisted values for other finite approximation types are close to those 
tabulated, while other unlisted values are readily calculated. The table 
also shows p G (x) and 8 0 (x), to be used in computing W{ ) values. The 
values of W for various estimators are given in Table 3.3. 

All of the EBEs except 8„ have much better performance than T; 
indeed their W(') values are remarkably close to W(8 G ). 

Obtaining analytic results for £ b 1F(EB), where EB here stands for 
any EBE, in particular any one of (a)-(g), seems virtually impossible, 
except as approximations for n large. Even then formulae would be 
excessively unwieldy. One needs for each x an expression for 
var„ (EB), in terms of x, and then a summation of such terms after 
multiplication by p G (x). For example, if we take estimator (a), the 
gamma G case, we can write 


S G (x) = 


P? + x(ji' 2 - P? - p\) 
(p' 2 -P?) 


(3.7.1) 


where p' r = E(X G ), r= 1,2. The EBE <5 g (x) is given by replacing p' r by 
m' r where m' r is the rth sample moment of X G . Obviously, mi expression 
in terms of a, fi, n, x can be written down for var{<5 c (x)J, using 
standard methods of approximation. Suppose we put var{<5 G (x)} = 
(l/n)F(a,/J,x). Then 

E n W(S a ) * W(S G ) + -X F(a, P, x)p G (x) (3.7.2) 

W X 


and such a formula could be used to obtain numerical values for 
various a, /?, n. The results given in Table 3.4 were, however, obtained 
by simulation because such a method is in any event needed for small 


A general comment on Table 3.4 is that the Bayes estimator 
becomes, relative to T, less advantageous as the ratio 
var(A)/{£var(X|A)} increases. In the Poisson-gamma case this 
ratio is 1/a. Of course, the same is true of EBEs. We may also note that 
the combination of G 3 and MM seems to provide a notably poor EBE 
for n = 10. Finally, allowing for the rather small n = 10 in Table 3.4 
the order of magnitude of the ratios [£ 10 {W(EB)} — W(^ G )]/ 
[£ 50 { 1F(EB)} — W(8 g )~\ seems acceptably close to 5, as would be 
expected according to approximations such as given in (3.7.2). 
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Table 3.4 Estimates of E n W(-) for the following EB estimators in the Poisson- 
gamma (a, /?) case: 

(a) Parametric G, moments 

(b) Parametric G. ML 

(c) Finite approximation, k = 3, moments 
(c) Finite approximation, k = 3, ML 

(g) Direct smoothing, non-decreasing ordinates 




o 

II 

<N 

II 

W 

II 

II 

« 

II 

II 

N> 

a = 5,/? = 5 

ms G ) 


1-67 

0-33 

0-83 

0-17 

W(T) 


5 

1 

5 

1 


(a) 

1-86 + 002 

0-380 + 0-005 

0-99 + 0-02 

0-204 + 0-004 


(b) 

1-86 + 002 

0-377 + 0-005 

0-99 + 0-02 

0-205 + 0 004 

n = 50 

(c) 

1-99 + 004 

0-427 + 0-011 

1-02 + 0-02 

0-213 + 0 004 


(c) 

2 00 + 0-04 

0-399 + 0-005 

1-02 + 0-02 

0-211+0-004 


(g) 

2-15 ± 0-05 

0-422 ±0-013 

1-11+0-05 

0-218 ±0-008 


(a) 

2-47 + 009 

0-53 + 0-03 

1-33 + 0-07 

0-31 + 0-02 

n= 10 

(b) 

2-46 + 0-09 

0-53 + 0-03 

1-35 + 0-07 

0-31 + 0-02 


(c) 

7-11+0-25 

0-85 + 0-03 

5-02 + 0-19 

0-57 + 0-02 


(c) 

2-81+0-12 

0-65 + 004 

1-64 + 0-12 

0-37 + 0-02 


3.7.2 The binomial distribution 
We consider estimation of X in 

P(x|A) = Qa*(1 -xr~ x 

for fixed m. 

(a) Parametric G, known to be beta (p, q): we have 
<5 g M = (P + x)/(m + p + q) 

and expressions for E(X r G ), r= 1,2 are easily derived. Estimates of p 
and q by the MM can be obtained from 

P = Pi(P-2 ~ m Pi)/{p't( m - 1) + m(p\ - p' 2 )} (3.7.3) 

q = (m- p\)(p' 2 - mp\)/{p\ 2 (m - 1) + m(p\ - p' 2 )} 

where p' = E(X G ), r = 1,2. Substituting sample moments for and 
p' 2 gives estimates of p and q; note that these estimates have to be 
truncated away from zero since p,q> 0 for a proper prior G. The 
estimates p, q of p and q derived from (3.7.3) are clearly consistent 
and 4(x) = (p + x)/(m + p + q) is a consistent estimate of S G (x). 
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Other methods of estimating p, q can be used, like ML, with the 
above mentioned truncation of the estimates. 

(b) Simple EB estimation: 

(i) Simple EB estimation of the odds ratio p = A/(l — A) is straight¬ 
forward, for 

= {(:)^ (1 - X) n - x dG(X)/p G (x) 

= + V/Poi*)- ( 3 -7.4) 


This leads to the simple EBE of p according to (3.4.3). If an estimate of 
X, rather than p is required, an approximation in the style of (3.4.11) 
could be used. It will be noted that estimation of the posterior 
variance of p is also straightforward since E{9 2 l(\ — 9) 2 \x) = 
[(x + 2)(x + l)/{(m - x)(m - x - l)}]p G (x + 2)/p c (x). 

(ii) Although it is not possible to obtain a simple EBE of X directly 
in the form (3.4.3) we can write the Bayes estimate <5 c (x) = 5 m (G, X; x) 
as 


5 m (G,X;x) 


f x+ 1 \ p G , m+ i(x + 1) 
\m+l) Pc, m (x) 


where p G , m (x) = 


m 


X x {\ - X) m x dG(X). To use this formula for 


EBE construction one would need past observations via binomial 
(m + 1, X) variates. Alternatively one could write 

<5 m -i(G, A;x) = ^^-^Pg.Jx + l)/p G , m -i(x). 


For EBE construction one could take x to be the number of successes 
in the first m — 1 of the current trials; see Robbins (1955). Elaboration 
of this scheme is possible. For example, one could take all permu¬ 
tations of the current trials, produce an x-value for each of them 
and take the mean of the resulting simple EBEs. 

(c) Linear empirical Bayes estimation: referring to formulae 
(3.6.1) and (1.12.2) we note that if we put U 1 (X) = X/m, U 2 (X) = 
X{X - l)/((m - l)m) then 

E{U x (X)\X} = X, E{U 2 (X){X } = XE(X\X). 

Therefore we can estimate the r.h.s. elements of (1.12.1) by 
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(l/m)X?=iX,/n and (l/m)Zr=i*;(*/-l)/n respectively, giving (w 0 , wj 
as the solution of 


lx w 0 _ 1 x 
X ?JLw'iJ m\J? 



where x r = XU t x \/n. 

(d) Approximation of G by G k : the method of obtaining an 
approximate EBE via approximation of G by a step-function G k is 
essentially the same as outlined for the Poisson case. 

The results of some studies of the performance of EBEs are 
summarized in Table 3.5. The results were obtained with beta(p, q) 
prior distributions and the method of moments was used in the 
estimation of parameters of the parametric G distribution and of G k . 
In the parametric G case the correct form of prior, i.e. beta, was 
assumed. 


3.7.3 The normal distribution 

We take the X data distribution for given mean X to be N(X, o 2 ) with 
a 2 known, and therefore without loss of generality o 2 = 1. 


Table 3.5 Binomial data distribution, beta(p,q) prior, E„W(-) values obtained 
by simulation, using 20 trials. The largest estimated coefficient of variation of 
any E„W for n = 10 was at m = 5, p = 10, q = 9, the estimated coefficient of 
variation being 12%. For n = 50 the largest estimated c.v. was 3.7% at m = 5, 
p = 3, q= 18. 


N 

P 

4 

W(T) 

mt 0 ) 

E„W($ e ) 


P„(4) 

PMs) 

10 

10 

9 

00247 

0-0082 

n = 10 
0-0126 

0-0141 

0-05 

0-05 

10 

3 

18 

00117 

00038 

0-0067 

00073 

0-05 

0-05 

25 

10 

9 

0-0094 

0-0054 

00073 

0-0084 

0-15 

0 15 

25 

3 

18 

00047 

0-0025 

0-0043 

00049 

0-40 

0-55 

5 

10 

9 

00474 

00099 

0-0184 

0-0186 

0-05 

0-05 

5 

3 

18 

0-0234 

0-0045 

0-0069 

00078 

000 

000 

10 

10 

9 

0-0237 

0-0082 

n = 50 
0-0091 

00094 

000 

0-00 

10 

3 

18 

0-0117 

0-0038 

0-0042 

00046 

0-00 

0-00 

25 

10 

9 

0-0095 

00054 

00058 

00069 

0-00 

005 

25 

3 

18 

0-0047 

0-0025 

00026 

00031 

0-00 

000 

5 

10 

9 

0-0474 

0-0099 

0-0119 

0-0121 

0-00 

0-00 

5 

3 

18 

0-0234 

0-0045 

0-0055 

00058 

000 

000 
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(a) Parametric G known to be N(n G , <r|): the Bayes estimate is 

<5 g (x) = (x + n G /al)/( 1 + 1 /a\) 

and the marginal X -distribution F c (x) is N(ft G , 1 + <x|). Hence 
obvious estimates of fx G and a G are x and max (0, s 2 — 1) where x and 
s 2 are the sample mean and variance calculated from the past x- 
observations. More details for this case have already been given in 
Example 3.5.2. 

(b) Approximation of G by G k : some information on the goodness 
of such approximations, judged by values of W(4> k ), are given in 
section 3.3. Estimation of fj, j — 1,..., k can be by ML or by the 
method of moments, or otherwise. Implementation of the ML method 
is as indicated for the Poisson case, the method of moments requires 
expressions for the first k moments of F Gk in terms of the 1’s. For 
example, with k = 3 these moments are a 31 = (Aj + A 2 + A 3 )/3, 

a 32 = (^1 + ^2 + ^\)/^ + 1 , a 33 = (^1 + A 2 + + X 2 + 

In practice estimates of X v X 2 , 1 3 can be obtained by equating a 31 

and x and minimizing distances between a 3r and x r , r = 2,3. 

(c) Linear empirical Bayes estimation: let G 1 (2f) = A' and 
U 2 (X) = X 2 - 1, then E{U X { A r )|2} = 2, E{U 2 (X)\X)} = A 2 = 2£(X|2). 
Thus the r.h.s. elements of (1.12.1) can be estimated by x and 
x 2 — 1 respectively, giving 

<5(w 0 , Wi;x) = x/s 2 + x(s 2 — l)/s 2 . 


Table 3.6 Data distribution N(X, 1), prior distribution N( 0, <r G ), values of 
W(Bayes), W(MLE), E n W(EB) for various EB estimators, and values of n. 
All E„W values were obtained by simulation; the largest s.e.’s of the tabulated 
values are for E 20 W (simple EBE d(ii) smoothed), 0-03, 0-04, 0-04. All other 
estimated E n Wvalues are subject to s.e.’s ^ 0-02 




<7 2 =(H 

o 2 g = 0-5 

O 

II 

CtO 

to 


IF) Bayes) 

0-09 

0-33 

0-50 


IF(MLE) 

100 

100 

100 

E\o 

lV(Parametric EBE) 

0-21 

0-49 

0-67 

E-2 0 

^(Parametric EBE) 

016 

0-42 

0-60 

Fjo 

W(G 3 approx. EBE) 

0-25 

0-55 

0-82 

k-20 

1F(G3 approx. EBE) 

016 

0-47 

0-69 

Ejq 

^(simple EBE d(ii) smoothed) 

0-29 

0-74 

1-01 

Fioo 

^(simple EBE d(ii) smoothed) 

017 

0-45 

0-65 

k-ioo 

^(simple EBE d(i) smoothed) 

0-21 

0-47 

0-65 
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(d) Simple EB estimation: At least two approaches are possible. 

(i) As in Example 1.3.7 we can write, noting a 1 = 1, 

S G (x) = x+f' G (x)/f G (x). (3.7.5) 

Implementation of (3.7.5) for EB estimation requires estimation 
of f a (x) and f' G (x), a topic discussed in section 3.4.6. 

(ii) Following the suggestions of section 3.4.4, one can put TjF(x | A) 
= fi udF(x | A), T 1 F(x | A) = Fix | A), leading to a fairly straight¬ 
forward EBE obtained from the approximation (3.4.11). Details 
can be found in Maritz and Lwin (1975). 

In Table 3.6 numerical results for smoothed versions of the simple 
EBEs of types (i) and (ii) are shown. The smoothing method is also 
described in Maritz and Lwin (1975). 

The results of numerical case studies using the EB methods (a)-(d) 
are also shown in Table 3.6. They indicate that EB estimates can be 
considerably better than MLEs even with relatively small values of n. 

3.8 EB estimation with multiple current observations: 
one parameter 

3.8.1 Introduction and general considerations 

For realistic applications of the EB approach the sampling scheme of 
one current observation and one past x ( at each A f seems severely 
restrictive. Let us now suppose that m ^ 1 independent observations 
Xj, x 2 ,..., x m on X are made at the current value A of A. Also assume 
that m, independent x-observations x n ,x i2 ,..., x im . are made at the 
past realization Aj of A, i = 1,2,..., n. One of the immediate conse¬ 
quences of this more general sampling scheme is that it enables one to 
deal with nuisance parameters. For example, in the case where 

A 

X = /V(A, a ) we need not assume cr known, because it can be 
estimated using the multiple observations at each A ; . At the same time 
derivation of EBEs can become more complicated, especially if the m f 
values differ from each other. 

When the m, values do not vary, in particular, if m i = m for all i, it is 
sometimes possible to reduce the problem to the single-observation 
case. This happens when a one-dimensional sufficient statistic T(x) for 
A exists, in which case one can simply replace the current observations 
by the value t of the sufficient statistics, and similarly replace each set 
of past observations by the corresponding t h i= 1 , 2 ,..., n. 
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If reduction by sufficiency is not possible a sub-optimal pseudo- 
Bayes approach may be adopted whereby the current observations 
are replaced by an estimate X, such as the MLE, with distribution 
F(X|A,m). Then calculate the estimate 


T \XdF(X\X,m)dGW 
’ JdF(X|A, m)rfG(A) ‘ 


(3.8.1) 


This estimate is not the Bayes estimate, hence IF(<5 c>m (X)) ^ IF(Bayes). 
In general, calculation of the Bayes and pseudo-Bayes estimates, and 
of IF(Bayes) and W^pseudo-Bayes), is rather complicated. Conseq¬ 
uently it seems impossible to make an accurate statement of the loss in 
overall efficiency resulting from such a non-sufficient reduction. The 
following example gives an indication of the quantitative effects of 
non-sufficient reduction. 


Example 3.8.1 Suppose that X = N(2, 1), A = N(p G , a G ) and that 

X= median (x 1 ,x 2 ,...,x m ). Then, for m reasonably large Xd^ 
N(X,n/2m). The pseudo-Bayes estimate based on X is 

2mX/n + p G /al 
G ' m( } 2m/n + p G /a% 

and IT(pseudo-Bayes) m 1/(2 m/n + 1 /a%). In this case the Bayes 
estimate is 


<5o(x) 


mx + Hg/gq 
m + Fg/^g 


with IT(Bayes) = l/(m + l/a G ). As we have remarked before, see for 
example section 3.1, we are really only concerned with cases where a G 
is sufficiently small for EB methods to be potentially useful. So, for 
example, if o G = 1/m we see that l^(pseudo-Bayes)/IT(Bayes) ss 1-22. 


If the data reduction is done by the ML method we have 
asymptotic equivalence of the pseudo-Bayes and Bayes estimates 
through the asymptotic sufficiency of the MLE as m-* oo. For a 
discussion of asymptotic efficiency, see for example Cox and Hinkley 
(1974, p. 307). Asymptotic sufficiency of the MLE and asymptotic 
normality of its distribution can be useful in evaluating W(-). In what 
follows we shall usually assume that reduction by the MLE is done. 
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Multiple current observations also enable one to deal with 
nuisance parameters. Perhaps the most obvious case is that of the X- 
distribution being N(X,o 2 ) with a 1 unknown but fixed, while 

d 

A = N(p G ,o G ); in other words the prior distribution of a is taken 
to be degenerate. In this case a 2 can be estimated by a 2 = 
— x i-) 2 /(E"=i( w > i - 1)), the usual within-group esti¬ 
mator. In formulae to do with the EBE of A the parameter a can then 
be replaced by d 2 , appropriate allowance being made for this 
estimation of a 2 in the calculation of E„H / (EB). Again, if all m, are 
equal estimation of n 0 and a G is straightforward. 

In the case of multiple parameter estimation we shall also adopt the 
policy of reducing the observations to the MLEs. Both in the one- 
parameter and the multiple-parameter cases estimation of the prior 
distribution and construction of simple EBEs remains a non-trivial 
exercise. 


3.8.2 Unequal m t : simple EB estimation 

The case studies of simple and other EBEs that have been reported 
indicate that the performance of the simple EBEs is relatively poor. 
Therefore, although many possibilities exist for constructing simple 
EBEs when m, ^ 1, unequal m j; only some of these will be discussed. 
Generally they lead to rather unwieldy calculations. 

1. m ^ all m{. let the current observations be summarized in the MLE 
X of A. Select m observations from each past set and calculate 

X,.X„. Then use one of the single observation methods of 

constructing a simple EBE. This procedure can, of course, be 
followed using every possible subset of size m of the observations at 
each past A ; . Averaging the results is one way of obtaining a single 

EBE using all of the past observations. There are n"= 

possible estimates, hence the calculations could be tedious. 

2. m ^ all m t : a compromise method seems to be the only practical 
possibility, namely, instead of trying to estimate the Bayes estimate 
based on the m current observations, obtain EBEs based on 
subsets of the current observations, and average them. In order to 
proceed in a manner similar to that suggested in 1, the smallest 
subsets would have to be of size min (mi, m 2 ,..., m„). Of course the 
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average of Bayes estimates from subsets of m observations is not 
the Bayes estimate itself, and the loss in efficiency may be great. 
This can be illustrated quite simply the considering the N(X, 1), 
N( 0,1) case where the Bayes estimate is 


<5 g (x) = 


mx 
m + 1 


and the average of m single observation Bayes estimates is 
<5 g (x) = x/2. The respective expected losses are 


W 



1 

m + 1 


-»0 as m-> oo 


W 



1 

asm-* 


oo. 


In the light of the difficulties mentioned under 1 and 2 above we 
shall concentrate on EBEs depending on estimation of G or an 
approximation to G. 


3.8.3 Unequal m t : parametric G of known form, and approximation 
by G k , using sufficient statistics 


Suppose that the estimate £(*! is a sufficient statistic, and that 
t,(x n ■ ■ ■ x imi ) are the corresponding sufficient statistics calculated from 
the past samples. The distribution of t, depends on X and on m h and its 
p.d.f. is A, m,). The prior distribution G is of known parametric 
form G(A; 0), depending on parameters 0; the dimension of 9 is k. 

The likelihood of the observations t 1 ,t 2 ,...,t n is 


W;0) = 



^(£ i |A i ,m i )dG(A i ;0) 


(3.8.2) 


and estimation of the parameters 0 can now be done by maximizing 
L(t, 6) w.r.t. ft 

In some cases it may be relatively easy to use the method of 
moments for estimating the parameters ft. Suppose that £(rJ|A,) = 
uMi ,«!;), r = 1,2,..., k. Then 


I TT 

i — 1 




(A; m ( )dG(X; ft), 


r=l,2,...,k, (3.8.3) 


and for certain distributions (3.8.3) has a fairly tractable form. 
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Replacing the l.h.s. of (3.8.3) by observed sums and solving the 
equations gives MM estimates of ft 

Example 3.8.2 Let A" = Poisson (A) and put 7) = 4- —P 

XtJ/m,. Then £(T i |A i ) = A i , E(T? |A f ) = A? + A,/m, If the prior 
distribution of A is gamma (a, /}) with p.d.f. {1/T(/J)} aP~ 1 e ~ a \ A > 0 
we have 

T?) = nfHli + l)/a 2 + (/i/a)t 1/m, 

When the form of G is not given the approach of approximating G 
by G t , a step-function, can be taken. In (3.8.2) and (3.8.3) G(A; 8) is 
replaced by G t (A; O l ,0 2 ,..., 6 k ) where 8 1 ,8 2 ,...,8 k are the points at 
which the function has jumps of size l/k. 


3.8.4 Linear EB using sufficient statistics 

In certain relatively simple cases the linear Bayes method of 
section 1.12 can be adapted for the present situation. Suppose that 

£(TjA) = A 

E(T 2 |A) = A 2 + CA/m, where C is a constant. 


Then (1.12.2) can be written 


1 E( A) irw 0 1 = r£(A)' 

£(A) £(A 2 ) + C£(A)/m JL J L £ ( a2 )_ ' 


(3.8.4) 


From the past observations estimates of £(A) and E(A 2 ) can be 
obtained by noting that 


1 " 

t = - £ is an estimate of £(A); 

Hj=i 


t 2 


- Yj tf is an estimate of E(A 2 ) + E(A)(C/n) £ 1/m, 
n i = l i = 1 


Example 3.8.3 Let X - Poisson (A) and put t, = Y!j= i Then 
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E(Tj | A,) = A,, E(Tf i A,) = A? + A f M, i = 1,2,. .., n. Writing y r = E(A'), 
r = 1,2, estimates of y i, y 2 are given by 

i=h 

t 2 = y2 + 1»i^I(l/m i ), 

which may be compared with the results in Example 3.8.2. 

Somewhat more generally we may have £(T|A) = A, but 
E(T 2 1 A) = A 2 + < 7 (A)v, where v is known, as is every v f corresponding 
to the ith component. An obvious special case is v, = 1/m,. 
Now t 2 is an estimate of £(A 2 ) 4- £{<j(A)}(l/n)X"= i v f . We need 
an estimate of £(A 2 ), consequently we need to find U 2 (X) such that 
E(U 2 (Ji Q|A) = A 2 or U„(X) such that £(E/,(^)|A) = ^(A). 

These may be non-trivial problems, but in many important cases 
the form of q(l) is such as to cause little difficulty. In particular, we 
may have q(X) not dependent on A, as when the data distribution is 
N(k,c r 2 ). If a 2 is known application of the linear EB method is 
straightforward in this case. Another interesting application arises in 
the design of a quality measurement plan as described in section 8.3.6. 
Here T, = /, where the /, are quality indices, the model being that 
£(/, ! A,) = A; and var(/ ; | A,) = A,/e i? where the e, are known constants. 

3.8.5 Unequal m f : reduction by ML estimation 

In the light of our earlier discussion on the use of MLEs, section 3.8.1, 
the methods to be employed here are essentially the same as those for 
the sufficient statistics. Typically difficulties will arise because the 
distributions of the MLEs will not be easily written down when they 
are not sufficient statistics. When the are large enough, approxi¬ 
mate normality of the MLEs could be used. 


3.8.6 Unequal m, generally: parametric G and G k approximations 


If G has the known parametric form G(A; 9) it is, in principle, possible 
to estimate the parameters of G by the method of maximum 
likelihood, using the more general version of (3.8.2): 


L(X; 0) = n I { n f( x iMi)dG(lii 9) 


(3.8.5) 
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Apart from a multiplicative factor not dependent on A x , A 2 , ..., A m , the 
likelihood in (3.8.5) reduces to the likelihood in (3.8.2) when t is a 
sufficient statistic. Finding MLEs of 6 1 ,6 2 ,...,9 k from (3.8.3) is a 
straightforward computational problem. As usual however, calcul¬ 
ation of Bayes and EB estimates, while straightforward, can be 
computationally difficult since multiple integration is required. 

If the parametric form of G is not known and G k is used to 
approximate G, all calculations are as outlined above with G replaced 
by G k . In this case the EM algorithm can be particularly useful. 


3.9 Unequal m, > 1: application to particular distributions 

3.9.1 The Poisson distribution 

The statistic T = (X 1 + —I- X n )/m is sufficient for A and since the 
distribution of X t + —H X m is Poisson (mA) we have 

E(T\A) — A 
E(T 2 \A) = A 2 + A/m; 
see also Examples 3.8.2 and 3.8.3. 


d 

(a) Parametric G; A = gamma (a, P); ML estimation of a, P 
In this example it is convenient to use T* = mT in the ML estimation 
because we have 


r oo m r 10 + r— 1 

P(T* = r)= I -«*-- e~ (x+mU dA 

[ Jo W) rl 

m'aP r (P + r) 

~(<x + m) p+r T(P)rl 


and L(T ; 9) in (3.8.2) becomes 

L(T*;0)=n 


* np+t? ) 
M(a + m/ +, ‘ trim 


my 


(3.9.1) 


(3.9.2) 


It should be noted that the formula (3.9.2) can also be applied when 
m, is not an integer. This can occur if only the values of tf are reported 
and they are, for example, Poisson counts over time periods of 
variable lengths proportional to m t . 
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8 In L _ nfl " / f} + tf\ 
dot. a. ;=i\a + mj 

Ain I « " 

-=»- = n In a - £ ln(a + m f ) + £ «F(0 +1?) - (3.9.3) 

op i = i i=1 

where T(z) = F(z)/r(z), the digamma function. Setting S t (t*;ac,P) = 
51nL/da = 0 and S 2 (t*;a,p) = dlnL/<3/? = 0 and solving the equa¬ 
tions gives the MLEs d,P of a,/J. 

For the purposes of evaluating the performance of the EBEs the 
covariance matrix of &,fS is useful. For more details we refer to 
section 3.11, but consider here the calculation of the relevant 
covariance matrix. The method used here is the two-parameter 
version of that described briefly in section 3.5.2. 

The estimates of a and p are obtained by solving the two equations 
in a and b. 


S r (t*; a, b) = 0. (3.9.4) 

An approximate, large n, formula for the covariance matrix C of <£, /) is 

C*tf- 1 V(«T) -1 . 

where the elements of V are cov {S r ,S s }, r,s= 1,2, (3.9.5) 
(8ES r (T*;a,b)' 


and the elements of 8 are 


r 


dc 


c = a,b. 


a = tz.b — P 


Expressions for var^SJ, etc. in terms of a, ft can be obtained using 
(3.9.1) and the independence of the Tf statistics. An interesting, and 
practically possibly more realistic approach is to allow the m t to be 
independent realizations of r.v. M. A considerable advantage of such 
an approach is that estimates of the elements of V and 8 can be 
calculated directly from the observed data. This is like using the 
observed information matrix in the usual ML estimation, a procedure 
advocated by many statisticians; see for example, Cox and Hinkley 
(1974, p. 302). 

Write S j (T*; a,b) = '£ l "= 1 U f (a, b) and S 2 (T*;a,b) = £? =1 K,(a,b), 
where, from (3.9.3) we have 

Ufa, P) = P/& — {P + 7?)/(oc + Mi) 

F;(a, p) = In a - In (a + Af.) + ¥()? + Tf) - '¥(p). 
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We shall simplify notation by putting S r (T*:a,b ) = S r (a,b), r = 1,2. 

Now (U h Vf, i = 1,2,..., n are independent realizations of the two- 
dimensional r.v. ( U , V). Then 

ES 1 (a,b) = XU i («,i>) 
d J^.^SUJ3a 

and we estimate {dES 1 (a,b)/da} a=xJ>=p by {dESfa,b)lda) a= ^ b ^^ etc. 
We can estimate var {^(a,/?)}, var {S 2 (a, /?)}, cov (S^oc, jj), S 2 (ot,/])} 
by nSl, nSl, nS uv where St, St, S uu are the usual sample variances and 
covariance calculated using observed Ufd,fi), U t (d, fi) values. For 
example, let U {d,$) = YJl=\U fd,fi)ln, then 

St= t {lWfi)-U.(<t,fi)} 2 /(n- 1). 

i = 1 

d 

(b) Parametric G: A = gamma (a, fi): MM estimation of cr.fi 
Details of this case have already been given in Example 3.8.2. The 
MMEs a, /? of a and fl are obtained as the solutions of 

MMP)=tti-nfi /« = 0 

i— 1 

» (3.9.6) 

M 2 (<X, fl) = X /. 2 - nP(PM)/ a 2 - fi/cr £ (1/m,-) = 0. 

■ =i 

Essentially the same techniques as those used in the MLE case are 
applicable here. In (3.9.5) d, fi are replaced by d, fr, S r by M r , r = 1,2, 
etc. Instead of Ufa, b), F ; (a, b) we have Yfa, b) = Ti~ b/a, Zfa, b) = 
tf — b/ami. 

(c) G k approximation of G 

We have to replace the gamma (a, fi) distribution which leads to (3.8.4) 
by G k giving 

P(T * = r) = -J £ e~ mkl {mkff/r\. (3.9.7) 

k j = i 

Unfortunately this formula cannot be simplified, consequently the 
analogue of (3.8.5) becomes 

L k ( T*; X) = ft e-^imXjf/tfl J 
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leading to somewhat unwieldy formulae for the derivatives of In L k 
w.r.t. ky There is also the complication that one needs to put 
X t < A 2 - • ^ A* in order to make computations feasible. 

The MM estimation of X u ..., A* is also complicated in this case 
because, with k ^ 3 it is not necessarily possible to find A such that the 
first k moment equations are satisfied exactly. The same difficulty was 
noted for m ( = 1. 

(d) Linear EB 

According to Example 3.8.3 estimates y lt y 2 for £(A), E(A 2 ) are 
obtained as 


Yi = t 

h = i 1 ~ Z (IM) 

n i = i 


and estimates vv 0 , Wj are obtained from (3.8.4) which becomes 



Vi 

y 2 + fi/m 



Here also assessment of the covariance matrix of w 0 , w, is relatively 
straightforward if we take m { to be independent realizations of a r.v. 
M. Note that y 1 , y 2 are solutions of 


n M*(y l ,y 2 ) = Z ti-Wi =0 

i ~ 1 

«A#f (y 1# y 2 ) = Z (tf ~ yiM) - ny 2 = 0 
so that U { , V t in section (c) above can be replaced by 
Y?(y 1 ,y 2 )=T i -y l 
ZT(y u y 2 )=Tf -yjnii. 

This enables one to calculate an estimate of the covariance matrix of 
f t y 2 and hence of w 0 ,^i. 


3.9.2 The binomial distribution 

We take A to be the probability of success and X, = 1 or 0 according 
as a success or a failure occurs. Then the distribution of T* - 
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X x + • • ■ + X m is binomial (m, X), and for T = T*/m we have 
E(T\X) = X 

£(T 2 |A) = X 2 + A(1 — X)/m 

The statistic T is sufficient for X hence we can adopt the approach of 
section 3.8.3. In Bayes estimation of X the natural conjugate prior 
distribution is beta(^ 1; /? 2 ), and we shall consider only this distri¬ 
bution for the parametric G case. As in the Poisson case this choice 
and limitation are defended on the grounds that the beta family is rich 
enough to cover many realistic situations. Details for other para¬ 
metric G families can be worked out following the patterns for the 
beta family. 


(a) Parametric G: A = bet&(P 1 ,P 2 ): ML estimation of fi u [i 2 

The marginal T* distribution is 

/W»\ T(/? t + r)r (/? 2 + m - r) r(ft t + P 2 ) 

“W r (P 1 + p 2 + m) W.WWz) 

and the likelihood in (3.8.2) becomes 

TfT * « A / m i\ T(Pi + t?)r(fi 2 + m, - tf) w 1 + p 2 ) 

,p> M \tf) r (p 1 +p 2 + m i ) n/MlW 

(3.9.8) 

The MLEs of p u p 2 are the solutions Pi,P 2 of 


Si(PiJ 2 )= E </'0 s i + *?) + #(0i + P 2 ) 

i = 1 

- +p 2 + mi) - mj/iPf) = 0 

S 2 (Pi,Pi)= t </'(p2 + ™ i -t*) + n</'(Pi+02) 

i = 1 

- n\l/(P t +p 2 + m)~ n\j/(P 2 ) = 0. 

Estimation of the covariance matrix follows steps like those given 
for the Poisson distribution. Here too considerable simplification is 
achieved if m 1 ,...,m„ are taken to be independent realizations of a r.v. 
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M. Let U i {Pi,P 2 ) and V i (P i ,P 2 ) be the elements of the sums defining 
Sj and S 2 respectively. Then the remaining calculations are like those 
in section 3.9.1(a). 

To estimate P lt p 2 by the method of moments we have to solve 


£ YAP M 

i= l 

£ 2i(PM 

i — 1 



pl 1 


P1+P2S 


Plifil + 1)(1 ~ l/ m *) 

Pi \ 

(Pi +Pl)(fil +P2 + 1) 

MiiPl +P2)) 


= 0 

= 0 . 


Calculations of estimates of the elements of the covariance matrix of 
the MMEs Pi,fi 2 are performed according to (3.9.5) with U t , V t 
replaced by Yi,Zj. 


(b) G k approximation of G 

The details of this procedure run parallel to those for the Poisson 
distribution with the obvious modifications to (3.8.11) and the 
analogue of (3.8.5). 


(c) Linear EB 

Refer to the method described in section 3.8.3 and note that 
E(T\X) — A 

E(T 2 \X) = X 2 + X(l-X)/m 
so that (3.8.4) becomes 



where y r = E(A r ), r = 1,2. Estimates y r of y r , r = 1,2 are obtained by 
solving 


£ £ (*i-7i) =° 

i=1 i= 1 

£ Zf(y i, y 2 ) = £ {tf -y 2 - (y x - y 2 )/mJ = 0. 

i =1 .=1 


Estimates of the variances and covariance of the y\ and hence of 
the w’s are again obtainable through (3.9.5). 
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3.9.3 The normal distribution 

We shall suppose that the distribution of X is N(X,o 2 ) with a 1 fixed 
and known. Since m t ^ 1 it would be possible to compute estimates of 
an unknown a 2 see also section 7.3.2(b). In the parametric G case the 
prior distribution of A will be assumed N(j 2 G ,<r G )*. For convenience 
below we shall write a G = v G . Details for other parametric G 
distributions will be similar, but usually more complicated 
algebraically. 


(a) Parametric G: N(n a , v G ): ML and MM estimation 

The marginal distribution of T is N(jx g ,v g + a 2 /m) hence it follows 

that the MLEs of y. G and v G are the solutions of 


£ U u (p %>v G )= £ (r< - /i G )/(v G + a 2 / m i) 

»= 1 i =1 


-1 + (t«- 


= 0 


Z U2(PG’ V g) - Z ) / , 2, . ' / , 

i=l i=l f(V G + ff /W() (V G + 

The solution is helped by noting that 


^fU-0. 

M) j 


Ac = Z £ i/(*<? + <r 2 /w>i) / Z V(*g + <r 2 /"»i). 


t = 1 


1=1 


In this case, also, the derivates needed in (3.9.5) are much easier to 
calculate than for the Poisson and binomial distributions. 

For MM estimation we note that 


E(T\Xj = l; E(T) = ix g 

E(T 2 1 A) = A 2 + o 2 /m; E(T 2 ) = n G + v G + o 2 /m 

so that we estimate n G and v G by solving 

Z Y iO*G> v G ) = Z ( £ i ~ Pg) ~ 0 

i= 1 i =1 

n 

Z Z ^G, v g) = Z( £ . ? ~ Pg~ v g - cr 2 /rn,-) = 0, 

i— 1 


A g — £ = (!/”) z £ i 

i= 1 

V G = Z ( £ . ? - t_2 - <r 2 An,.)/n. 

1=1 


giving 
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In theory we should take v G = max {0, £"= 1 (tf/n — t — c7 2 /m f )/n}; with 
large n the truncation should rarely become operative. 

(b) Approximation of G by G k 

The details are essentially the same as for the parametric G case, but 
they are computationally more complicated. Some simplification can 
be effected by assuming G to be symmetrical. If k = 3 this means that 
we have three equally spaced Xj values to estimate. 


(c) Linear EB 
Equation (3.8.4) becomes 



Vi 

y 2 + o 2 /m 



where y r = E( A r ), r = 1,2. Estimates of y 2 and y 2 can be obtained as 
the solutions of 


£ YT(y i.y 2 )= £ (*r-ri) 

(=1 i=l 

£ zfiyi.y 2 )= £ 

•=i i=i 

giving 

Vi = t; y 2 = t 2 - (i/n)t t 2 £ (l/mf = t 2 - er 2 (l/m) 

i— 1 

and 

W 0 = (fi <r 2 /m)/{y 2 - y\ + cr 2 /m) 

Wi = (f 2 “ 7?)/(V2 - fi + a 2 /™)- 

The covariance matrix of f l5 f 2 can be estimated according to (3.9.5) 
by substituting Yf, Zf for U h V h etc. 


3.10 Nonparametric EB estimation 

When we have multiple past and current observations an argument 
put forward by Johns (1957) shows that it is possible to develop yet 
another type of EB estimator. It has two great merits. First, G need 
not be estimated explicitly. Second, it is nonparametric in the sense 
that the exact mathematical form of f(x\X) need not be known. A 
disadvantage is the difficulty of application when X is continuous. 
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We first consider the simplest case where Johns’s method can be 
applied. Let there be n pairs of past observations, and one current 
observation, denoted as follows: 


past current 


x 11 ,x 21 ,...,x„ 1 x 

x 12i x 22>---> x ti2 

The two observations (x a ,x i2 ) are two independent realizations of 
the r.v. X, which has the d.f. F(x|Aj), i= 1,2,..., n. We assume that 
E(X | X) = X; re-parametrization can be carried out if necessary. Now, 
every pair of past observations can also be looked upon as 
realizations of two independent r.v.s X y and X 2 , each of which has the 
d.f. F(x|A) for a certain value of X. Hence the joint density of X u X 2 
and A is 


/(x 1 1 X)f (x 2 1 X)dG(X), 

. Ui*, X 2 dF ( X 2 I *)f( x I X)dG{X) 

X} U x /F(x 2 \X)f(x\X)dG(X) 

JXf(x\X)dG(X ) 

$f(x\X)dG(X) 

= S G (x). (3.10.1) 


Equation (3.10.1) suggests that, in the case of a discrete X, we can 
obtain an estimate of the Bayes estimator, i.e., an EB estimator as 
follows: first select all those pairs (x a ,x j2 ) in which x a =x. Let the 
number of these pairs be n x , and denote the x f2 values for which 
x a = x by x j2 (x), j =\,2,...,n x . Then, if n x is greater than 0, the 
EB estimator is the mean of the x J2 (x). Formally, the EB estimator 
<5*(x) is defined by 


<5*(x) = 


f 

1 n x 


n x >0 

otherwise. 


(3.10.2) 


We observe that, owing to the independence of X 1 and X 2 , every 
A j = Xj(x) — x J2 (x), where 2/x) is the parameter value obtaining when 
the observation x,- 2 (x) was generated, may be regarded as an 
independent realization of a r.v. with zero mean. 
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Hence, 


<5*M = - 7 - £ - — Z A J -* E(A |.x) - 0 = <5 c (x), 

n x j-i n x j=i 

in probability, as n -► oo, if n x -> oo, in probability, and P(n x > 0) -+1. 
The latter conditions are clearly satisfied in all cases of practical 
interest. 

When n x is small the estimate <5*(x) of £(A|x) defined by equation 
(3.10.2) may be too unreliable, and to counter this effect, Johns (1957) 
has suggested the following modification: 


<5 c (x) = <5*(x), n x >c 
x, n x <c. 


(3.10.3) 


Thus, for n x <c, a ‘conventional’ unbiased estimator is used. It is 
easily verified that <5*(x)-*<5 G (x), (P), as n-> oo. 

Another modification concerns the assumption, above, that we 
have one current observation but pairs of past observations. In 
circumstances where EB procedures are likely to be used, it is also 
likely that there will be two current observations, x t and x 2 , say. One 
way of utilizing both of these is to put 

Sf(x lt X 2 ) = ^(xj + t<5?(x 2 ). (3.10.4) 

Denoting Jf(x 1; x 2 ) by <5, we have 

W(d:(x 1 ,x 2 )) = E(8- A) 2 , (3.10.5) 

by definition, where E denotes integration w.r.t. X, x 1; x 2 . The r.h.s. of 
(3.10.5) can be expanded to 

£(A) 2 +i{£[^(x 1 )] 2 + £[<5 *(x 2 )] 2 } + i£{^(xj-^(x 2 )} 
-EJA-^xJ + A^x,)}. 

Also, 

^?(x 1 )) = £(A) 2 + £[^(x 1 )] 2 - 2£[A-<5*(xj)], 

and remembering that, since x t and x 2 are identically distributed, 

W(*i)] 2 = E[4*(*2)] 2 
£[A-<5f (xj)] = £[A-(5*(x 2 )], 

we find that 


W(xJ)- WQ)=iE[&*(x l )-S*(x a )-\ 2 3*0. (3.10.6) 
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By the same argument as before, 

&-+ 2 ^ g (*1 ) + < 5 C (* 2 )) = 4(*1 . * 2 )- 

In general 

2(^o(. X l) "b ^gC^)) ^ <5g(-*1>*2)> 
but by the same argument giving (3.8.6), 

Wisest))- 

This result, with (3.10.6), provides some justification for this particular 
method of combining both current results in an EB estimator. 

Referring to section 3.2, we observe that, since 6* (x) -* <5 G (x), (P), in 
the case of a single current observation, S*(x) is a.o. under the 
condition (3.2.1). In a like manner, it can be shown that 

W(S) - W$ Q {x j , x 2 )), (P), and £„ W(S) - W(f e (x t , x 2 )). 

Johns (1957) has computed upper and lower bounds for £„(<5f (x)) 
and £„(<5*(x)) in the case where/(x | /) is the Poisson distribution and 
G(X) is the T-distribution given in Example 1.3.2 with a = 2, /? = 10. 
These results are shown in Table 3.7. 

Table 3.7 may be compared with the appropriate entries in Table 
3.4. The Johns estimators appear to be markedly better than the 
Robbins estimators, but it must be remembered that the Johns 
estimators are based on rather more past data. The Robbins 
estimators are relatively inefficient because cognisance is taken only 
of the frequencies of occurrence of x and x + 1, the individual values of 
other observations not being used directly. The Johns estimators are 
somewhat less wasteful since the actual values of some of the x i2 are 
used. Another relevant point is that the Johns estimators could 
possibly be improved by smoothing. 


Table 3.7 The non-parametric EB estimator of Johns (1957). 
Bounds for E„W() in the case of a Poisson kernel, gamma (2,10) 
prior distribution 


n 

Bounds 

Lower 

Upper 

W&g) 

W(T) 

15 

10-70 

12-51 

1-67 

500 

60 

4-46 

5-54 

1-67 

500 

120 

3-20 

3-87 

1-67 

5-00 
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The processes described above can be extended in an obvious 
manner to the case where we have r — 1 independent current 
observations and n sets of r independent past observations. For 
example, let r = 3 and let the current observations be (x x , x 2 ), the past 
series being n triplets (x u ,x i2 ,x i3 ), i = l,...,n. By steps similar to 
those leading to equation (3.8.1) it can be shown that 

£(* 3 \X 2 = x 2 ,X l =x 1 ) 

J X3 f 2 x 3 df(x 3 |A)/(x 2 |A)/(x 1 |A)dGffl 
U 3 UdF(x 3 \X)f(x 2 \mxM)dGW 
= 5 g (x 1 ,x 2 ). (3.10.7) 

If we now select from amongst the past triplets those for which 
(X|i,x j2 ) or (x (2 ,x a ) = (x 1 ,x 2 ), and compute the average of the 
corresponding x i3 ’s an EB estimator is obtained. 

Formally, let 

, (1, if one of the permutations of (x 3 ,x 2 ) = (x fl ,x i2 ) 

m,(*„*,). j 0 othemise 


and 


M(x u x 2 )= £ M i (x 1 ,x 2 ). 
Then the EB estimator is 


<5*(xi,x 2 ) = < 


1 n 

---f 

M{x lt x 2 )i=i 

0 , 


Mi(x x ,x 2 )x i3 , 


M(x 1 ,x 2 ) > 0 
otherwise. 


The asymptotic properties of <5*(x,, x 2 ), as n -*■ oo, are like those of 
<5*(x). Modifications analogous to those of equations (3.10.3) and 
(3.10.4) can be made in an obvious manner. Further, extension of the 
definitions to current observational vectors with r elements, and past 
vectors with r + 1 elements can be readily carried out. Johns (1957) 
describes this general case. 

Returning to the case of a single current observation, and pairs of 
past observations, we introduce another modification to <5*(x). Since 
the past observations are randomly generated for every parameter 
value, there is no reason for matching x with x n rather than x f2 . 
Therefore, let (xff.xg), s = 1,2 be the two permutations of (x a ,x i2 ); 
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we distinguish all permutations, even if the values of x n and x i2 are the 
same. 

Put 


and 


fl, if xj? = x 
15 {o, otherwise, 


n 2 


M(x) = Z Z MJx)- 


i — 15=1 

Then define the EB estimator by 

>’( X ) = T^T £ Z M b (x)x$, for M(x) > 0 

M(X) i = i s =i 

= x, otherwise. 

Example 3.10.1 illustrates the difference between <5*(x) and y(x). 


Example 3.10.1 Let the past and current observations be: 

2 3 8 4 2 0 1 2 
5 4 1 3 2 4 2 

Then 

S*(2) = (5 + 2)/2 = 7/2 
y(2) = (5 + 2 + 2 + l)/4 = 10/4. 

The above modification can be easily extended to observational 
vectors with r + 1 and r elements, and alterations for the case of an 
r + 1 element current vector may be made as before. 

Krutchkoff(1967) has proposed an extension of the Johns method 
applicable when, subsequent to evaluation of every current observ¬ 
ation, further information about the current parameter becomes 
available. It is assumed to take the form of a statistic, Y, independent 
of X, the current observation, such that E(Y | A) = A. Thus, when the 
current observation is X = x, the available information is: 

Past current 


x 1 ,x 2 ,...,x„ x 
J'l.J'z.-'-.J'n 

The distribution of Y need not be the same as the distribution of X. 
Nevertheless, by essentially the same argument as that leading to 



NONPARAMETRIC EB ESTIMATION 


111 


equation (3.8.1), 

E(Y\X = x) = £(A| = x), 

leading to the construction of an EB estimator as before. Put 

when x t = x 




otherwise, 


and 


M(x)= £ M,(x). 


Then the EB estimator is 


777 -r X Mi(x)y{, for M(x) > 0 

i t \ M \ x ) <=i 

•AnW = 

x, otherwise. 

Asymptotic optimality of ifi„{x) has been established by Krutchkoff 
(1967). 

There is some difficulty in applying John’s approach to a con¬ 
tinuous X. In theory, the probability of a past value equalling x is zero. 
However, if all results are rounded off, or grouped, that is, if the r.v. X 
is effectively ‘discretized’, this difficulty can be overcome (Johns, 
1957). Evidently, problems concerning the coarseness of the grouping 
arise, but Johns has shown that ‘discretizing’ can be carried out such 
that asymptotically satisfactory results are obtainable. 

For any finite past sample the nonparametric EB estimators 
defined are clearly ‘non-smooth’ functions of x, in general. They share 
this property with the Robbins EB estimators for the Poisson 
distribution and other members of the exponential family of distri¬ 
butions. The possibility of direct smoothing of such estimators was 
discussed in section 3.4.5. Justification of smoothing by polynomial 
fitting, or similar devices, may be sought in the fact that substitution 
of a parametric prior d.f., G, usually leads to a ‘well-behaved’ function, 
(5 c (x), as the Bayes estimator. 

Krutchkoff (1967) has considered smoothing of the ‘supplementary 
sample’ nonparametric EB estimators by fitting a straight line. If 
£(A|x) = ct + fix, then, from (3.10.1), £(T|x) = a -I -fix, and the pro¬ 
posed procedure is to estimate a. and fi by the usual method of least 
squares, treating the x f as controlled variables, and the y t as 
observations subject to error. 
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When smoothing any EB estimators, the argument used in 
section 3.4.5 suggests the use of a form of weighted least squares 
fitting, with weights proportional to p G (x), when X is discrete. In 
practice the p G (x) will be unknown, but estimates directly propor¬ 
tional to the M(x) or n x can be formed. Using these estimated weights 
will, in KrutchkofFs case, lead to the same procedure which he 
advocates. The choice of a smoothing function remains open, but it 
should be influenced by knowledge of certain general characteristics 
of the Bayes estimator. Thus, in examples of the type discussed above 
it would be appropriate to impose constraints which make the 
function non-decreasing. 

Additionally the parameter may be restricted to a certain known 
interval. The geometric distribution (cf. Example 1.3.4) demands 
attention to both types of constraints. Krutchkoff (1967) has given the 
following example, which shows that smoothing can be effective, even 
when the fitted function is not strictly appropriate: the kernel 
distribution is binomial with index parameter n = 5, and the prior 
distribution is normal with mean 0-2 and s.d. 0 05. The best unbiased 
estimator T = Xj 5 has W(T) = 00315, while W(5 a ) = 0-0023. With 
n = 30, W(if/ n ) = 00060. Smoothing by fitting a straight line gives 
• l/*{x) = a + bx, and with n = 30, W(tp*) - 0 0037. 

3.11 Assessing performance of EBEs in practice 

Many studies of the performance of EBEs have been reported, some 
of them to do with finite and relatively small numbers, n, of past 
realizations of A, others with asymptotic properties as n -* oo. On the 
whole they show that EBEs can perform very well by comparison 
with non-Bayes estimators. The relative goodness of EBEs depends 
largely on the dispersion of the prior distribution, and on n. These 
studies do not, however, give a direct answer to the following question 
that arises in every particular potential application of EB estimation. 
Is the expected performance of the EBE better than that of the MLE 
for these conditions and these past data? In order to answer such a 
question one needs estimates of IT(Bayes), W(MLE), E„ IT(EB), at 
least. 

In this section we shall discuss estimation of the W values needed to 
make a judgement in practice of the relative merits of EB and other 
estimation methods. The most tractable solutions of the problem are 
obtainable in the parametric G situation, and in linear EB estimation. 
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3.11.1 Parametric G, single past and current observations, 
m f = m = 1 

If G is known to have parametric form G(A;ar), depending on 
parameters a, the mixed A'-distribution is F(x;a). From the past 

observations x lt x 2 .x„ it is possible to estimate a by one of the 

standard methods, for example maximum likelihood. Suppose 
IF(MLE), E n W( EB) are functions of at, and that they are estimated by 
the same functions of a. In particular d may be St, the MLE of or. 

In order to decide, say, between EB and ML estimation we may 
need the estimate IF(MLE) of IF(MLE) and its standard error, and 
similarly for the other estimation methods. Refinements might be to 
ask for estimates of IF(ML) — JF(Bayes), with standard errors, and so 
on. Now, in principle estimation of W'(ML), W(Bayes) and W( ) for 
any other non-Bayes method is straightforward. Estimation of 
E n IF(EB) is less straightforward, and we shall give some illustrations 
in the following examples. 

(a) Poisson data distribution, gamma prior G 

Suppose that the prior gamma density is written as in Example 1.3.7, 

Then the MLEs of a and 0 from 

the past data are the solutions of 

d In L _ n0 " /? + «; _ Q 

dot a i =i a + 1 

d\n L, n 

—- = nlna-wln(a+1)+ £ t//(/Z + x,) - mj/(p) = 0; 

op i -1 

these equations are the special versions of (3.9.3) with m, — 1. We shall 
follow the steps in section 3.9.1. Hence we put 

U i (a,b) = b/a-(b + x i )/(a+ 1) 

Vi(a, b) = In a — ln(a + 1) — ij/(b + x t ) 4- >p(b) 

and obtain 

dUJda = — b/a 2 + (b + x f )/(a + l) 2 
dUi/db = i/a - l/(a + 1) 
dVJda = 1/a - l/(a + 1) 

8Vi/8b= -*P'(b + Xi)-il/'{b). 

Using these results we can obtain an estimate covariance matrix for 
the MLEs <2, /?. 
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For the EBE based on A, P we have 


W'(EB) - W(Bayes) = 


(p+p/« 

P + p/cc] 

\ A+l 

a + l J 


(di + l) 2 (a + l)a 2 ‘ 

( 3 . 11 . 1 ) 


A fairly simple approximation for £„ W(EB) - W(Bayes) is obtained 
by replacing A in the denominators in (3.11.1) by a before finally 
taking expectations. The result is 


E n W( EB)- W(Bayes) ~ 


var (/?) + (P/ct) 2 var (A) - 8{fi/a) cov (A, p) 

(a + l ) 2 


var"(£) p 
+ (a + l) 3 a 2 ’ 


(3.11.2) 


Example 3.11.1 We take the n = 50 observations to be those 
summarized in the column headed f„{x) of Table 3.2. Straightforward 
calculations give the estimated matrix A as 


and 


/-82-575 16-53 \ 

V 16 53 - 3-4189 ) 


' V = 50 


1-7466 
- -3528 


— *3528 \ 
0-0725 ) 


giving the estimated covariance matrix of A, P as 


A / 0-21 1-01 \ 

C \1-01 5-15 y 

In the calculations above we used the MLEs A — 1-310, P = 6-548. 
The estimates of W(Bayes), W(ML) are 

r 2 ' 6 . 

W(ML) = P/A = 5-00 


From formula (3.11.2) the estimate of £„W(EB)— W(Bayes) is 0-12. 
Hence 


£„W(EB) = 2-28. 

These results suggest that EB estimation should be preferred to ML in 
conditions generating the observed data. Recall that the particular set 
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of data was generated with a = 2, jS = 10, a case studied in some detail 
in section 3.7.1. The estimate = 2-28 agrees quite well with 

the £„1F(EB) values reported in Table 3.4. 

The estimates of IF(Bayes), W'(ML), £„fF(EB) obtained as in 
Example 3.11.1 have standard errors which could be calculated if a 
more formal decision between EB and ML is to be made. To obtain 
the s.e.s of the estimates of IF(Bayes) and VF(ML) is straightforward, 
for £„1F(EB) it can be more complicated. However every estimate of 
these quantities can, in some circumstances, be expressed in terms of 
&, p, so that an estimate of the s.e. can be calculated. The circumstance 
in which this can be done quite easily is where it is appropriate to 
express C in (3.9.5) as the inverse information matrix, whose elements 
are functions of a and p. Then var(d), etc., in (3.11.2) can be replaced 
by these expressions, E„ H^EB) — IF(Bayes) can be expressed as a 
function H n (oi, P), estimated by H n (&, p) whose s.e. can be estimated in 
the obvious way. This procedure will be illustrated in the case of the 
normal data distribution for which the actual manipulations are less 
complicated. 


(b) Normal data distribution, normal prior G 
Following the notation of section 3.7.3 we estimate p G by x, the mean 
of the previous observations, and <7g + 1 by s 2 , the sample variance of 
the previous observations. For the present purpose we shall take 
dg — s 2 — 1, not max(0,s 2 — 1), since this simplifies calculations, 
and introduces no complications to do with 5 G , W(d G ) or E„ W(§ G ). 

The EB estimate is 


M*) = 


x(s 2 — 1) + X 


and from (3.5.3) 

E,wfy - WiK) - -+ („_ 2 ;; 6 _ s ) }/(■ + * 

W(S G ) = { ff G + (x - fl G ) 2 + (s 2 - 1) 2 }A 4 - 
The estimate of £„ 1F(EB) — VF(Bayes) is, therefore. 


£>(EB) = 


n — 1 


2n + 6 


n(n 


and the estimate of 1F(ML) 


3) (n — 3)(n — 5) J 
- IF(Bayes) is D(ML) = 1/s 2 . 
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We might choose between EB and ML on the basis of 
D(ML, EB) = D(ML) — D(EB) for which we have 


£{D(ML,EB)} 


var 


f n — 1 2n + 6 } / n — 1 \ 

= { 1_ ^3j _ (n-3)(n-5)jV^3j 


1 


(1 +<t 2 g) 


(»-1) (»-l) 2 


1 


(n - 3)(n - 5) (n- 3) 2 J " (1 + a 2 G ) 2 ' 


Example 3.11.2 Suppose that n = 20 past observations give the 
result s 2 = 1-32. Then the estimated value of £{D(ML, EB)} is 0-65 
with estimated s.e. = 0-26. Such a result would indicate superiority of 
EB over ML. 


3.11.2 Linear EB: m t = m = 1 


(a) Poisson data distribution 

Following section 3.9.1 (d) with m f = 1, t = x, the linear EB estimate is 
<5(*; vv 0 , Wj) = w 0 + WjX 

where 

w 0 = x 2 /{x 2 -(x) 2 ) 

Wi = (x 2 - (x) 2 - x}/(x 2 - (x) 2 ). 

Large n approximations for £„W(linear EB) can be obtained by 
taking w 0 , to be approximately unbiased for w 0 , w 1( giving 

£„W(linear EB) - W(linear Bayes) var(vv 0 ) 

+ (yi + y 2 ) var (Wi) + 2y 1 cov (vv 0 , Wj). 

Approximate values for var(vv 0 ), var(vv 1 ), cov(w 0 , wj can be ob¬ 
tained as suggested in section 3.9.1(d) or directly by using standard 
formulae for variances and covariances of moments. 

Also 

W(linear Bayes) = Wq + (w t — l) 2 (yf + y 2 — Ti) 

+ {2(w 1 -l)w 0 + wf}y 1 

where y r = £(A r ), r = 1,2, and w 0 , w 1 can be expressed in terms of 
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y 2 - An estimate of ^(linear Bayes) is obtained by noting that 
estimates of y,, y 2 are y t = x, y 2 = x 2 — x. 


(b) The normal data distribution 

Taking the known a 2 — 1 in section 3.9.3(c), and m f = m = 1 the linear 
EBE becomes 


with 


c5(x; w 0 ,w 0 ) = Wo + vi^x 

w 0 = x/s 2 
Wj = (s 2 — l)/s 2 . 


We can express £„ W(linear EB) — JT(linear Bayes) as 
„ {*- E ( Xg )} 2 , [1 1 




A crude but useful approximation is obtained by putting s 2 in the 
denominators of the expression above equal to var (AT G ), giving 


{l+H 4 (X c )-n 2 2 (X G )}/{nn 2 {X G )} 


where /i,(X G ) are the central moments of the marginal ^-distribution. 
These can be estimated directly from the past observations. 



CHAPTER 4 


Empirical Bayes point estimation: 
vector parameters 


4.1 Introduction 

Most realistic estimation problems involve vector parameters. In 
univariate studies the simplest problems with more than one 
parameter tend to be those of location and scale, but many others 
occur where the data distribution F depends on more than one 
parameter. Multivariate data distributions usually have vector 
parameters of dimension greater than one. 

The applications of EB methods discussed in Chapter 8 illustrate 
the statements made above. They also indicate that interest often 
centres on just one parameter or one function of several parameters. 
Keeping in mind that the general idea behind the use of EB methods is 
to improve the precision of individual estimates by using past results, 
one may well decide to apply an EB approach only to parameters of 
primary concern. We shall discuss this in more detail for the location- 
scale problem. 

Vector parameter problems occur naturally in linear regression. In 
the usual general linear model it is reasonable to assume that the 
covariance matrix of the estimates of the parameters of interest is 
known apart from a multiplicative constant, the residual variance. 
This leads to a special class of EB problems, and in particular, those 
involving the multivariate normal distribution with known cova¬ 
riance matrix. 

4.2 Location-scale estimation 

We deal here with data distributions of the form F{ (x — A)/<x}; A and a 
are, respectively the location and scale parameters. Generally we may 
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consider past random samples of sizes m i( i = 1 , 2, ..., n, obtained with 
past realizations (A f , <x ; ) of r.v.s (A, X), from populations of the form F. 
However, a rather common type of statistical model is one in which a 
does not vary, i.e. X is actually a constant a. As an example, one can 
consider the one-way analysis of variance model with random 
location effects. 


4.2.1 Fixed unknown o 

Every past sample with m, > 1 provides an estimate of <r, and the 
natural steps are to pool these estimates into one estimate & of a. Then 
we replace a by d in whatever formulae arise for Bayes or EB 
estimates. Apart from this obvious change the methods for obtaining 
EB estimates are exactly like those described in Chapter 3. 


Example 4.2.1 F{(x — A)/<r} is the N(k, a 2 ) distribution, m i = m> 1, 
the distribution of A is N(p a , ol). Then d 2 is the usual pooled sample 
variance, d 2 /m + 6% = s 2 , the sample variance of the estimates x, of 
the individual A, and the EBE of the current A is 

4(*) = (mx#l + (i G 6 2 )/(ms 2 ). 

Calculations like those giving (3.5.2) lead to 


E n W( EB) = 


(mol + a 2 ) 
nm 



. m F \ °W-ol6 2 

(mol + ° 2 ) " 1 ms2 


2 


which can be simplified somewhat by steps like those giving (3.5.3). 
The result is 


E n W( EB)^ 


1 (n~ l) 2 
n(n — 3)(n-5) 




+ 


n(m- 1) + 2 (n— l) 2 

n(m— 1) (n — 3)(n — 5) 


(ol + o 2 /m) + Vk(Bayes). 


The details for non-normal F((x — X)/o) are more complicated if 
one attempts to derive an EBE which is an estimate of the actual 
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Bayes estimate. Two compromises which have been discussed before 
are: 

1. reducing the m sample values of the MLE X of A, treating X as being 
normally distributed, i.e. N(A,co 2 /m), where c is a constant 
determined by the form of F; 

2. adopting a linear Bayes approach. 


4.2.2 Fixed unknown a: linear Bayes 

Suppose that X is an estimate of A, unbiased with var (A) = ka 2 /m\ 
the constant k, like c above, is determined by the form of F and the 
type of estimate A. Then a linear Bayes estimate, based on A, is 
5(A;w 0 ,W!) = w 0 + Wj.X with (w 0 , Wj) given by 


1 £(A) Jwol f£(A)“. 

£(A) £(A 2 )+ ka 2 /m_ Wj £(A 2 )_ ’ 


(4.2.1) 


see also (1.12.2). 

An empirical version of the linear Bayes estimate is obtained simply 
by replacing the elements of the l.h.s. matrix and r.h.s. vector in (4.2.1) 
by estimates derived from the past observations. Estimation of a can 
be carried out in diverse ways, but generally one would take the final 
estimate a to be a pooled version of estimates derived from the n past 
samples, each of size m. 


Example 4.2.2 Let (l/a)f((x — A)/o) = (1/2 cr) exp (— | (x — A)/<x|), 
— oo < x < oo, the double exponential distribution. A simple estimate 
of A,-, in the ith past sample is the sample mean A f , and the ith sample 
variance, v h is an estimate of 2d 2 . Taking the final estimate 6 2 of a 2 to 
be ^(pooled sample variance over n past samples), (4.2.1) becomes 


1 ^iYwol = r Ai 
A t A 2 _w i _A 2 — 2d 2 /m_’ 


(4.2.2) 


where A u A 2 and a are the estimates obtained below. 

From an actual data set the difference £„ Bilinear EB) — Bilinear 
Bayes) can be estimated by first finding approximate values for the 
variances and covariances of A 1 ,A 2 ,d 2 . These can be obtained by the 
method discussed in sections 3.9 and 3.11 . We obtain A ,, A 2 ,5 2 from 
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the estimating equations 

t(I i -A 1 ) = ZU li = 0 

i=l 

t(%-Ai) = lU 2 i = 0 • (4.2.3) 

i ~ 1 

Z ( u i/2-o :2 ) = Z [; 3i ==0 

i= 1 

Application of the previously described methods is now straight¬ 
forward. The covariance matrix of A u A 2 , a 2 is estimated by C whose 
elements are (1 /« 2 )£ U P iU qi , p, q = 1,2,3; note that the method (3.9.4) 
is followed and that A = nl in this case. 

Since w 0 , are obtained from the estimating equations (4.2.2), 
which can be written in the same form as (4.2.3), the approximate 
covariance matrix of w 0 , w, is estimated by 

P -1 QCQ'(P -1 ) T 

whereP= ^ J,Q = ^ ^ . The value of Bilinear 

Bayes) is estimated by 

{w 0 + (wi — lMi} 2 T (w^ - 1) 2 {A 2 - 2a 2 /m — A\} + wj(2d 2 /m ) 
and £„ Bilinear EB) by 

Bilinear Bayes) + var(vv 0 ) + 2A ! cov (w 0 , Wj) + A 2 varfu^) (4.2.4) 
Finally, for the non-Bayes estimate X, W(X) is estimated by 2 d 2 /m. 
Comments on Example 4.2.2: 

1. The procedure outlined in the example applies with minor changes 
to many other location-scale distributions. 

2. £„ Bilinear EB) is given approximately by (4.2.4) with all estimates 
replaced by the corresponding parameter values. 

3. Linear EB quantile estimation, a topic considered more generally 
in section 4.2.4, is straightforward here because we can express 
the p-quantile as £ p = X + y p cr, and the linear EBE of £, p is 
<5(1; w 0 ,w x ) + y p 5, when y p is a known constant. 
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4.2.3 Estimating both A and a: parametric G 

If it is agreed to estimate A and a by the appropriate means of the 
posterior joint distribution of A and £ no new principle is involved. 
But calculations can be complicated, depending on the form of F. If a 
two-dimensional sufficient statistic for (A, a) does not exist choosing a 
suitable type of prior G is not straightforward. Moreover, as we have 
seen before, calculation of Bayes and EB estimates tends to be difficult 
in these cases, involving tedious integration. The only relatively 
simple cases seem to be the normal data distribution and the use, for 
whatever reason, of a finite joint distribution of A and a. 


(a) The normal N{k, a 2 ) data distribution 

The joint sufficiency of the sample mean and variance for A and a 2 and 
the existence of a relatively tractable natural conjugate prior can be 
exploited in this case. For convenience we write p = a 2 ,x for the usual 
sample mean and b = £”= i(*,-— x) 2 /(m — 1). Then the joint p.d.f. of x 
and b is 


f(x,b\k, /?) = (const.)/? <m l)l2 b m/2 3/2 

x exp [- m{(m - 1 )b/m + (x- A) 2 }/2/1]. (4.2.5) 

Following Raffia and Schlaifer (1961) and Evans (1964), the natural 
conjugate prior distribution of A and P has the joint p.d.f. 

g(k, P) = (const.)/T (1 +<1/2)v, exp [— {A + Z(A - Z>) 2 }/20], (4.2.6) 

where v, A, Z > 0 and — oo < D < + oo are constants. 

While we may, in general, be interested in estimation of a variety of 
functions of A and /?, we shall here consider only estimation of A, P and 
the p-quantile A + y PS fp. In every case we shall use the quadratic loss 
function, so that the Bayes estimate is the posterior mean of the 
function being estimated. 

For A we have the Bayes estimator 

JJ A/(x, ft | A, P)g{k, P)dkdp 
1 |X> )" Uf(x,b\k,P)g(k,P)dkdp ’ 

and noting that 

m(x - A) 2 + Z(A — D) 2 = (m + Z) A — 


mZ 

+ (m + Z ) 


(x-D)\ 


(4.2.7) 
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E(X\x,b) = 


mx + ZD 
m + Z 


(4.2.8) 


Again making use of (4.2.7), the Bayes estimator of ji is 
(m—l)b + A + mZ(x - D) 2 /(m + Z) 


m x,b) = - 


(m + v - 3) 


a result given by Evans (1964). By the same technique the Bayes 
estimator of v //? is 


E(J~P\x, b)- 


_ f (m - l)fe + A + mZ(x - D) 2 /(m + Z) ] 1/2 

r[j(m + v)-l] 

m(m + v)-i]- 


To check that the estimator of ^Jp is of the ‘proper’ form, we note 
that, as m becomes large, 


T[i(m + v)-| /m + v V /2 / 1 \ 

r[*{m + v)-l]~V 2 V V 2m + 2v — 1 /’ 

so that E(^ffi\x, b ) ~ yfb. Finally, the Bayes estimator of A + y py /P is 

£(2|^6) + y p £(,//?|jc, b). 

Let us now suppose that we have an EB situation, n past estimates 
(x f , 6;), i = 1,2,..., n, being available, generated in the same way as x 
and b. In the notation of section 3.8 we are taking m, = m; the variable 
mi case is slightly more complicated. They may be regarded as n pairs 
of observations drawn at random from the bivariate population 
whose p.d.f. is 


f G (x,b) = 






f(x,b\*,0)g(A,P)dAdp 


£l/2<m-3) 

~ K [(m-l)b + A + mZ(x - D) 2 /{m + z)] 1/2(v+m) ^ T ' 

The method of maximum likelihood can be employed to estimate the 
parameters A, Z, D, v, but the method of moments is more tractable. 
To find integrals of the type \\x r b s f G (x, b)dx it is easier to use the form 
Hn = ^^\x r b s f(x,b]X,P)dxdbdXdp, integrating first w.r.t. x and b. The 
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following results are obtained: 

Mio = D 



It will be noticed that we require v > 5 for these moments to exist. 
Estimates of the from past data are just the first and second sample 
moments of the x, and b t . For example, // 10 is estimated by /Ij 0 = 
(xj + x 2 + • • • + x„)/n. Equations (4.2.9) can be solved to express A, Z, 
D, v in terms of /ij 0 ,..., fj' 02 , and replacing /x' s by jx'„, estimates A, Z, 
D, v are found. The restriction that v > 5 must be observed, but since 
v = 5 + 2kn' 0l /(n' 02 — kn oj), where k = (m + 1 )/(m — 1), this poses 
no difficulty unless fl 02 < kfi^. Since v > 5 implies that li' 02 /Hoi > k, 
decreasing as v increases, we adopt the convention of letting v = + oo 
when fi' 02 < kfioi. This means that A -* + ooasv-> + oo since fi' 0 x will 
be finite and positive. An estimate of Z is obtained from 

(l/Z) + (l/m) = fi 20 /n' 01 , 

and we use the convention that Z = + oo when (^ 20/^01 ~ V m ) ^ 0. 
In practical computations it is convenient to replace Z and v by large 
positive numbers when the conventions suggest the value + 00 . 

The EB estimators are, as usual, obtained on replacing A, Z, D, v, by 
their estimates in the expressions for the Bayes estimators; for 
example, denoting the EB estimators by 

E{X\x,b), E(P\x,b), E( y /]i\x,b), 

we have 

E(X |x, b) = (mx + ZD)/(m + Z), etc. 

Since 



the effect of letting v-> + 00 while \_A/{y — 3)] remains finite, is that 
E{P\x,b)->p 01 
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and 

E(^P\x,b)Mfi'cn) il2 - 

When we do not have all m i = m the estimating equations for A, D, 
Z, v, derived from (4.2.9) become 


fi'oi =A/(v-3) 


/l 2 o = M/(v-3)}|^t 1/m^+l/zJ + D 2 \ (4.2.10) 

P 02 = A 2 1 X ( m i + - 1) j I {(v - 3 )(v - 5 )} 

The sample values fi 10 , etc. are defined as before. 

Performance of the EBEs. Consider first the EBE of A. It can be 
expressed as 


£(A|x, b) = dj 0 + d> 1 x 

with <u 0 = ZD/(m + Z), cb x = m/(m + Z). For this estimate of A we 
have 


IF(cu 0 + d^x) = {(coj — 1 )D + c6 0 } 2 

+ {(<«!- 1 ) 2 /Z 4- (bf/m}A/{v - 3) (4.2.11) 

W(a> 0 + cM) = A/{(m + Z)(v - 3)} 

W(x) = A/{m{v — 3)}. 


From (4.2.11), assuming <u 0 , cb 1 to be approximately unbiased for c o 0 , 
co u 

E„W( EB) c* D 2 varied) 4- var(co 0 ) 4- 2D cov^, <w 0 ) 

+ var^Xm 4- Z)A/{(v — 3)mZ} 4- IF(Bayes). 

Given a particular data set the covariance matrix of co 0 , can be 
estimated by the technique discussed in Example 4.2.2, using the 
estimating equations (4.2.10). An approximate expression for 
E„IF(EB) in terms of the parameters A, Z, D, v can be obtained on 
replacing C by the approximate matrix of theoretical moments. 

We can write the EBE of P as 

E(p\x, b) = /j 0 + hih + h 2 (x - D) 2 
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from which expressions for the relevant expected losses W( ■) can be 
derived. They are somewhat more complicated than the correspond¬ 
ing results for A. Expressions for the relevant quantities when 
estimating are even more difficult. Estimation of quantiles 
therefore presents a rather difficult problem, even in the apparently 
simple case of the normal data distribution. We defer further 
discussion to section 4.3 where an alternative approach to estimation 
of quantiles is considered. Finally we note that estimation of the 
various parameters could be performed by the ML method instead of 
the method of moments as was done above. 


(b) Finite G 

By finite G is meant a distribution with a finite number of discrete 
mass points. The marginal distributions of both A and (S are, therefore, 
finite step-functions. As an example we shall consider a distribution 
with mass points of equal weight 1/6 at (fi 2 ,D ±S 2 ), 

(P 3 ,D±8 3 ) with 0 < jSj </? 2 /J 3 , This distri¬ 

bution has some of the features of the distribution (4.2.6); and many 
other configurations are possible. 

For given fi, D, 8 calculation of the Bayes estimate of A straightfor¬ 
ward according to 


E(A|x) 


;= i 


zuo-wn/i^-F^i+^+wn/ 


a 


t= i 


Xj—D-di 

Jt 


&{ 


but evaluation of fF(Bayes), etc., is very tedious. 

Estimation of ft, D, 8 under the EB sampling scheme is easiest by the 
method of moments; the details are similar to those for linear EB 
estimation, as given in section 4.2.4. Alternatively, one may fix the 
above mesh points and assign probability masses 0 ( to them. The d t 
can then be estimated by application of the EM algorithm. 


4.2.4 Estimating both A and a: linear Bayes and EB 

There are several ways in which the idea of linear Bayes estimation 
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can be applied in the present context. The simplest seems to be to 
begin with conventional estimates X and of of A and a; typically they 
may be linear functions of order statistics, unbiased with minimum 
variance. Now consider estimation of a linear function A x X + A„a, A x 
and A„ being given constants. By choosing A x , A„ suitably this 
includes estimation of X, a or a p-quantile. 

Suppose that we estimate A x X + A„a by 

<5(X,d; (o) = (o 0 + co x X + (o„a 

where co 0 , co x , o)„ are chosen to minimize 

J(co 0 + co x X + (o„a — A x X — A a a) 2 dF(X,a\X,a)dG(X, a). 

Let y„ = $X r <f dG(X,a), var(X|A, cr) = a 2 V ll (m), var(cf|A, <x) = 
a 2 V 22 (m), cov(X, d\X,a) = a 2 V l2 (m), E G (X r a 3 ) = ^X r d s dF{X,d\X,a)- 
dG(X,c 7 ). Then the optimum, i.e. linear Bayes © is given by 


' 1 E g (X) 

£ g (X) £ g (X 2 ) 

E g (o) E g (Xo)_ 


E g (6) 


O»o 

E g (X6) 


"a 

E a (a 2 ) 


_"o-_ 


io + ^Toi 
Atf2o + A a y u 

aTi 1 + A.Voi_ 


(4.2.12) 


which we can write as 


m g © = r a. 

The expected loss, W (linear Bayes) is 

A T {r,-r T M G ‘r}A 

where 

r, = ( Vl ° M. 

V?n ro2/ 

Under the EB sampling scheme estimation of the elements of M c 
and T can proceed as follows, estimates being written £ G (X), y 10 , etc., 
and note that 

£ G (X 2 ) = y 2 0 + *u(»»)yo2> E e (X,tf) = y ll + V 12 (m)y 02 , 

E G (G 2 ) = y 02(1 + 

Let X h cfj be the conventional unbiased estimates of X, a at the ith past 
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realization of A, o, based on m, observations. Then 

E G w=y l0 =mi k 

i= 1 

£<?(d) = V 01 =(l/n) Z 

1=1 

n n 

ny'20 + V02 Z F n( m .)= z 

i= 1 i=l 

/I n 

11 + V02 Z v i 2 (m t )= Z Wt 

i= 1 i = 1 

fn+ Z ^("i) jfiw - Z *?• 


(4.2.13) 


Estimation of E„W (linear EB), etc., can be carried by straightfor¬ 
ward modification of the procedures described for Example 4.2.2. 

An alternative approach to linear Bayes and EB estmation in the 
location-scale family of data distributions is not to reduce the m 
independent observations to the estimates X and a, but to base the 
estimates directly on order statistics. Thus, estimate A } X + by 
c5(x; oj) = o) 0 (m) 4 - Z”= 1 co j( m ) x U) choosing to so as minimize 
» /• 

{d(\;o}) — A x X - A„o} 2 dF(x\X,o)dG(X,(r). 

The details are similar to those given above and are shown more fully 
in Lwin (1976). 


4.3 Quantile estimation 

The p-quantile of the continuous distribution F(x) is and it is 
defined by the relation F(£, p ) = p. If F is of the location-scale type as 
discussed in section 4.2 we can express £ p as a linear function of X and 
<t, i.e. <^ p = X + y p a. Thus, for location-scale type F this problem can 
be seen as a particular application of the ideas developed in 
section 4.2. However, the question of quantile estimation need not be 
addressed in only this restricted setting. It does arise quite naturally in 
the nonparametric context where the exact form of F may be 
unspecified. Since the linear Bayes and EB methods are less demand¬ 
ing about the form of underlying distribution than the true Bayes 
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method, this approach seems well suited to the problem of quantile 
estimation. 


4.3.1 Linear Bayes and EB estimation 


Let G be the joint prior distribution function of the relevant 
parameters. Suppose that c p is a conventional estimate of £ p based on 
m independent realizations of X in the current component problem, 
and that its variance is var(^ p |^ p ) = v p (m). We shall assume that the 
bias of <f p is negligible, and that to a satisfactory degree of 
approximation v p = co p /m. This latter assumption is not strictly 
needed but it is a useful simplification. In certain special cases more 
accurate statements about the first two moments of the conventional 
estimate can be made; an example is that of X having a normal 
distribution. In the notation just introduced G is the joint prior 
distribution of E p and Q p . It will also be useful to write E a for 
expectation w.r.t. G, E X[C for expectation w.r.t. the observations 
conditional on fixed values of the parameters, and so on. Variances 
may be similary subscripted. Where there is no confusion subscripts 
may be omitted. Thus var(<f p |£ p ) = co p /m = \&x x ^ G (l p \t, p ). 

In the manner of section 4.2.2 we define the linear Bayes estimate of 
i l p as <5 C (0 0 ,0 t ; l- P) = 0o + 0i £p> with 0iand 02 determined so as to 
minimize the expected squared error £ G £x| G (0 o + 0i £ p — S p ) 2 . The 
appropriate values of 0„ and 0 t are given as the solutions of 


( 1 E g (E p ) V0 O \ (E G (E p )\ 

\E g ( 3 p ) £ G (E 2 ) + £ G (Q p )/mA0i/ \E G {Z 2 P )J 


(4.3.1) 


Solving this equation we find that the linear Bayes estimate can be 
expressed as 


<5g(0o, 01; t r ) = (1 - 0i)£g(s p ) + 01 

where 

0i = var G (E p )/{var G (S p ) + £ G (£2 p )/m}. 

Also, 

W (linear Bayes) = l/[l/{var G (S p )} + l/{£ G (Q p )/m}], 
W (i p ) = E G (Q p )/m. 


For the empirical version of the linear Bayes estimate we need 
estimates of the elements of the matrix on the left side of (4.3.1) and of 
the vector on the right side. We shall assume that in each component 
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problem we can calculate a conventional estimate <f pi of the 
component £ pi , and also a conventional estimate d> pi of the compo¬ 
nent co pi . We shall also assume that these estimates are unbiased or 
that their biases are negligible. Then estimates of £ G (E P ), £ G (E 2 ) and 
£ G (fl p ) are given by 

£ G (S p ) = (l/n) t i pi 

i~ 1 

E G (E 2 pi ) = (l/n) t (&-<M (4.3.2) 

i = 1 

E G (Cl p ) = (l/n)£ cb pi . 

i=l 

Replacing the expectations in (4.3.2) by their estimates gives 
estimates of the p's, and hence the linear empirical Bayes estimate 
whose expected loss is 

^(linear EB) = (P 0 + (p, - 1) 2 £ C (E 2 ) 

+ 2&(0 1 - 1)£ C (S P ) + j} 2 E a (Q p /m). (4.3.3) 

Using formula (4.3.3) it is relatively straightforward to evaluate 
£„ (^(linear EB); see also Maritz (1989). 

4.3.2 Location-scale distributions of known form 

Suppose that the distribution function of X is £{(x — X)/o} where the 
form of F is given. Typically X, o will be estimated by the ML method, 
as will be S, p = X + y p 9. The constant y p will be known and we also 
have o) p = k p o 2 where k p is another known constant. For example, if 
F is N(X,<x 2 ) and p = 0-75 we have y p = 0-67449 and k p = 1-23. In 
this example we then have <f pi = + 067449s, and cb pi = 1 -23sf. 

4.3.3 Distributions of unknown form 

When the form of the distribution of X is not specified the natural 
distribution-free estimate of £ p is the sample p-quantile. In this case 
var(f p |£ p ) = p(l - p)/{mf 2 (t; p )}, approximately, where f(f p ) is 
the density of the distribution of X at x = £ p . Thus, in this context, 
m p = p(l - p)// 2 (^ p ). 

We can now apply the theory of section 4.3.1, as was done in 
section 4.3.2, the only change of note being that the method of 
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estimating var(J p | q p ) is different. Various possibilities are open here, 
depending on the sample sizes m,-. With moderately large sample sizes 
one may begin with a kernel density estimate f p of /(£ p ) and then 
estimate <u p in the ith component by a> pi = p( 1 — p)// pi . 

An important aspect of these calculations is that no specific 
assumptions are made about the form of the underlying F. The only 
parameters of consequence are E a (E p ), E a (E p ), and E G (Q P ). There¬ 
fore, in postulating a sequence of realizations {£ pi , co pi }, i = 1,2,..., n 
it is not necessary to suppose that they are generated by the same form 
of distribution F with varying parameters. The form of F itself could 
vary. 

To conclude this section we consider estimation of IV(linear Bayes) 
and E„ W (linear EB) from a given set of data. A bootstrap approach 
seems natural in the given conditions. We begin with the estimated 
values of the expectations appearing in relations (4.3.2); they are 
E g (E p ), etc., and note that var G (S p ) = max[0, E a (E 2 p ) - {£ G (S P )} 2 ]. 
The estimated W (linear Bayes) is 

IT(linear Bayes) = l/[l/{var G (S p )} + l/{£ G (Q p )/m}]. 

In the following bootstrap calculations the estimated moments and 
P 0 , fii are treated as if they are the true values. The steps are: 

1. Select one of the component data sets at random. 

2. Select an w' ; value at random using the empirical distribution of 
component sample sizes. 

3. Generate m' random observations from the empirical distribution 
function of the data in the set selected in step 1. 

4. Do steps 1-3 n times. 

5. Calculate (£' pi ,w' pi ), i=l,2,...,n, and p o , Pi using the data 
generated in steps 1-4. 

6. Calculate W(P' G + P'£' p ) by formula (4.3.3) treating p o , coj, etc., as 
the true values. 

7. Do steps 1-6 a number, N, of times and find the mean of the W 
values obtained by step 6. It is an estimate of E„ W (linear EB). 

4.4 The multivariate normal distribution 

Study of the multivariate normal can be defended on the grounds that 
essentially multivariate data often are collected and analysed, and the 
EB sampling scheme may well apply to such collections. Here we 
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would have to assume that both the mean vector and the covariance 
matrix, in general, have non-degenerate prior distributions. Estim¬ 
ation of multivariate normal means also arises quite naturally in 
applications of general linear models. An example of this sort is 
discussed in detail by Hui and Berger (1983); it is also discussed in 
Chapter 8. In this type of example it is sometimes not unrealistic to 
assume that the covariance matrix changes from component to 
component but is known, apart possibly from a multiplicative 
constant. This is analogous to the univariate normal case with 
variable numbers m, of observations at the component problems, but 
with the variance remaining constant from component to component. 

The notation we shall use is: the data distribution is N(X, E) and the 
prior distribution of (X, 2) is G(X, E) where the matrices are k x k and 
the vectors are k x i. 

4.4.1 Known E, A distributed N^q^q) 

Standard theory of the multivariate normal distribution gives the 
result that the posterior distribution of A]x is multivariate normal 
with mean vector 

£(A|x) = (£->+EJ 1 )' 1 (S" 1 * +> G ) (4-4.1) 

and covariance matrix M = (E _1 + Eg 1 ) -1 . The marginal x- 
distribution is IV(/i g ,E + E G ). 

Recall from section 1.6 that we may choose to take the loss when 
estimating X by S as L(S, X) = (8 — l) r A(5 — X), this being a natural 
generalization of squared error loss for single parameter estimation. 
The Bayes estimate, i.e. that 8 which minimizes the expected loss, does 
not depend on A, but the actual expected loss does. In order to 
simplify the following exposition A will be set equal to the unit matrix 
so that L(8, X) is just a sum of squared errors. 

With A = I the expected squared error of the Bayes estimate is 

tT(Bayes) = tr{ME“ 1 M T } +tr{MEG 1 M T } (4.4.2) 
= tr(M T ). 

In the EB setting the past observations are x„ i = 1,2,..., n and we 
shall assume that the operative fc-variate normal data distribution at 
the ith component has mean and known covariance matrix E,. The 
subscript i to E indicates that we are considering component 
problems that are not necessarily identical. Then we can estimate p G 
by x = (l/n)X?=i x i- To estimate E 0 , let S be the matrix of second 
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moments about the origin calculated from the past x f vectors, thus 

S = (l/n) £ Xi xT. 

i ~ 1 

Now, since E(x |X |jAj) = L.Af + E, an estimate L a of L G is given by 
S = xx T + t G + (1/n) £ L(. (4.4.3) 

i= 1 

This estimation is by the method of moments. ML estimation of ft G 
and is possible but more complicated; see the analogous univariate 
normal case in section 3.9.3. 

Example 4.4.1 In longitudinal health studies it is common to 
measure a response variable Y at times t u < t 2i - - < t mi , say, on 
subject i, obtaining results yn,y 2 n--->ymi- Although analyses are 
often simpler if the t-values are the same for each subject, arranging 
such a data set in practice is usually not possible. Let us suppose that 
the data set for the ith subject is summarized in the intercept a, and 
slope b, of a Y on t regression line fitted to the data by the method of 
least squares. 

Under the usual assumptions a ; and b t are unbiased estimates of a, 
and Pi the parameter values characterizing subject i, and the joint 
distribution of a h b t is bivariate 



where Xtji = Often it is reasonable to take the residual 

variance erf, as fixed at <r 2 for every subject. We assume this to be so 
for our present illustration. 

In this example (a,, b,) takes the place of (x li ,x 2i ) in the preceding 
theory. The covariance matrix of (a,,bi) for given (a,, /?,) is known 
except for a 2 which can be estimated by a pooled residual variance. 
So, 
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and 



is an estimate of fi G . 


The empirical Bayes estimate given by (t 0 = x, and L G obtained 
from (4.4.3), is 

E( A|x) = (IT 1 + t G ')- ME“ *x + t G l (i G ) = M(E _1 x + t G '(i G ) 


and its expected loss, W( EB) is 

(ME - l /i G + ME G ‘0 G - ^ G ) T (ME- + ME g % - Mo) 

+ tr(M£~ 1 - I)£ g (M£- 1 -1) + tr(ME“ X M). 

The difference 1T(EB)— W(Bayes) can be expressed as 

{ME g 1 (A g - Mo)Y {ME g 1 (Ag - Mg)} 

+ tr{(M-M)3T 1 (E + E G )£- 1 (M-M)}. 

The expression for E^tTfEB)- IT (Bayes) simplifies slightly through 
independence of M, E G and fi G to 


-£„tr{(I — ME _1 )(E + E g )(I — ME -1 )} 
n 

+ E n tr{(M - M)E- ME + E C )E- MM - M)}. (4.4.4) 


Asymptotic optimality of the EBE is seen to follow quitejeadily 
from (4.4.4), but despite the distribution of the elements of M being 
known, evaluation of the expression is not straightforward. 


4.4.2 Linear Bayes and EB 


(a) Known £ 

We estimate A by 

<5(x;a,B) = a + Bx (4.4.5) 


where a is k x 1 vector and B is a k x k matrix, a and B chosen so as to 
minimize the expected loss 


W{ a + Bx) 


* 


(a + Bx - A) T (a + Bx - X)dF(\\k,'L)dG(X) 


= {a + (B - I)/i G } T {a + (B - I)ji g } + tr(BE G B T ) 

+ tr(B£B T ) (4.4.6) 
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where F is the multivariate normal distribution with mean vector A 
and known, fixed covariance matrix E'. 

Differentiating w.r.t. the elements of a and B the following 
equations are obtained for a, B: 

a + B/i g = n G 1 (447) 

a/< G + B(//g/i G + Eg + E) = fi G fi G + Eg J 

in obvious analogy with earlier univariate results. From (4.4.5) we 
obtain, after multiplying the first equation by ft G , 

B = E G (E G + E ) -1 = (Eg l +E _ 1 ) _ 1 E _1 j 
a = E(E c + E)- 1 // G = (E G 1 +E- 1 )- 1 E G 1 |i G .j 

The results (4.4.8) agree with (4.4.1). 

In order to construct an empirical version of the linear Bayes 
estimate we need estimates of E G and fi G , and they can be obtained by 
the method of moments. First we note the expectation of the 
marginal Xj is fi Gj . Therefore we can estimate fi a by x as before. Then, 
for the marginal variates X r , X s , we have 

E(X r , X s ) = E G E F (X r X s ) = <x Grs + + /Tgi-Fcs* 

so that the estimate of E G is again given by formula (4.4.3). 


(c) E = ff 2 r, r known 

In Example 4.4.1 we took the residual variance a 2 = co to be the same 
in each component problem. In that case it seems obvious that one 
should estimate it by a pooled residual variance, and if n is large 
replacing a 2 by such an estimate would have little effect on W (linear 
EB); a study of the effect of estimating co in this context is reported by 
Martz and Krutchkoff (1969). One of the advantages of the linear 
Bayes and EB approach is that co can be allowed not only to be 
unknown, but also to vary randomly from component to component. 
Thus we have a model where E, = a^r,, T, being known at each 
component, but co h i = 1,2,..., n being independent realizations of a 
non-degenerate random variable. 

Under this model the linear Bayes estimate of the same form as in 
(4.4.5) is given by a and B derived from 


a + B/i G =|i G 

a ft J G + B (jt G fil + E g + (o G E) = n G nl + E G 


(4.4.9) 
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where a> G = J co dG(X,co) — E G (a>). Note also that W(a + Bx) is of the 
same form as (4.4.6) with L replaced by ct> G E. 

To derive an empirical version of the linear Bayes estimate we need 
estimates of fi G , E G , co G . Suppose that d > f is an unbiased estimate of cOj 
at the ith component. Then we can estimate oj g by a> = (co, + co 2 + • ■ ■ 
4- co„)/n, p G by x as before and L G by L 0 = S — xx T - <»(l/n)£"= 1 r. 
The linear EBE is of the same form as the linear Bayes estimate in 
(4.4.5) with a, B replaced by a, B, the solution of (4.4.9) with fi G , £ G , co G 
replaced by their estimated values. 


Example 4.4.2 In studies of the type discussed in Example 4.4.1 it is 
sometimes more realistic to assume that co,- varies from subject to 
subject. Typically the sets of t tj values will also differ from subject to 
subject. The data on n = 50 subjects summarized in this example were 
generated to resemble data actually collected in a longitudinal study 
of lung function of factory workers. A typical individual data set is the 
following for subject i = 6; the notation is as for Example 4.4.1: 

t tj 3 6 7 8 10 

y tj 3437-20 3704-00 4010-74 3918-82 4175-15 

giving a 6 = 3128-42, b 6 = 105-994, d> 6 = 8995 


/ 1-9254 -0-2537 \ 

0-2537 0-03731 ) 



3976-7 Y 
30-66 )’ 


50 


(1/50) £E f = 

i= 1 


0-7635 

-0-1092 


c b = 56288 


-0-1092 \ 
0-01986/ 


/ 70278 - 5773-8 \ 
v — 5773-8 1291-7/ 


- _/ 27302 372-8\ 

\ 372-8 173-8 J' 

The empirical Bayes estimate for subject i — 6 is obtained from 


and 


B 6 — 2 G (E G + cuE 6 ) 1 — 


( 0-5842 3-7372\ 

\ 0-02836 0-2497 / 


<5(x; a,B) = B 6 




/ 3762-5 \ 
V 25-42 /' 



THE MULTIVARIATE NORMAL DISTRIBUTION 


137 


(c) General variable £ 

In (a) and (b) we have dealt with rather special structures for the 
covariance matrix of the data distribution, although it has to be said 
that examples like Example 4.4.2 are common enough. More gener¬ 
ally we may consider the multivariate analogue of the case treated in 
section 4.2.3(a), where E has a distribution with expectation 
f£dG(k;E) = £ G (£). 

With the linear Bayes estimate defined as before the equations for a 
and B are 

a + B/i g = fi G j 

a/i l + B (pofil + £ G + £ G £) = fi G fi r G + E G J 

and note that W(a + Bx) is given by (4.4.6) with £ replaced by £ 0 E. 

For an empirical linear Bayes estimate we need an estimate of £ G £. 
Suppose that £,• is an unbiased estimate of E £ at the ith component. 
Then we estimate £ G £ by (l/n)£"=i£j, and estimation of the other 
unknowns in (4.4.10) proceeds as in cases (a) and (b). 

4.4.3 Simple EB estimation 

Here we shall take the covariance matrix of the data distribution, E, as 
known, or known apart from a multiplicative constant and fixed from 
component to component. We are, as in previous sections, dealing 
with the multivariate normal distribution, concerned only with 
estimating the mean vector X. 

Let C = £ - *, then the joint density of x for given X is 

/ (x 12.) = (const.)exp {— ^(x — A ) T C (x — X)} (4.4.11) 

and 

8J ^ > = -f(x\X) £ (xj- Xj)C rj , (4.4.12) 

OX r j=i 

r=l,2,...,k. 

Integrating both sides of (4.4.11) w.r.t. G(A) and dividing by / G (x) 
we obtain 

t x J C rJ= i E(Aj\x)C rj , (4.4.13) 

JG W J=l 7=1 

r=l,2,...,fc. Let p r = {df G (x)\dx r }/f G (x), r=l,2,...,k; S Gj = 
£(Aj|x), j = 1,2,..., k. Then we can write (4.4.12) as 

p + Cx = C«J G . 


(4.4.14) 
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From (4.4.14) a simple EBE of A can be calculated ifpis replaced by 
an estimate /) of p. Such an estimate requires estimation of the 
multivariate density f G (x) and its derivatives. Methods for estimating 
these quantities exist and can be applied to the observed vectors x ; , 
i= l,2,...,n which can be regarded as independent observations from 
a population with density / G (x). For multivariate density estimation, 
see, for example, Cacoullos (1966). A derivation and a study of simple 
EB estimation in multiple regression is given by Martz and Krutch- 
koff (1969). Smoothing of EBEs of this sort is discussed by Bennett 
and Martz (1972) and Lemon and KrutchkofT (1969). 

4.4.4 Performance of EBEs 

In the somewhat general, but clearly realistic setting of non-identical 
components, it is difficult to calculate quantities like E„W{EB). 
Numerical results for isolated cases seem to be less useful here because 
of the additional arbitrary element in the loss calculations represen¬ 
ted by the choice of matrix A in if (5, A) as defined in sections 4.4.1 and 
1.6. Even if A is a diagonal matrix it means that the total loss is a sum 
of weighted squared errors, and in a particular problem the overall 
relative values of IF(Bayes), IF (ML), £„1F(EB) will depend on those 
weights. At the same time the expected squared errors of estimates of 
individual parameters are still minimized, these corresponding to 
obvious special choices of A. 

To illustrate some calculations that can be made and some 
problems, we reconsider Example 4.4.2. 

Example 4.4.3 Take the data of Example 4.4.2 and consider a 
current component with tj configuration giving the covariance matrix 
of (a, b ) as o 2 L. Then the linear EBE of the current a, f is given by 
(4.4.5) with B replaced by 

B = £ G (t G + d)2:)- 1 

and a replaced by 



For this estimate 

1F(EB) = (a + Bp G - p G ) J (a + B p G - p G ) 

+ tr (B£ G B T ) + co G tr (BEB T ); 

see (4.4.6). The data of Example 4.4.2 were generated according to a 
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model in which the joint prior distribution of a, /? is 
f/4000\ /40000 0\1 
IV 30 j’V 0 9/j’ 

N fto is independently distributed N{ 250,625) and a random number 
m ; of t -values is selected from the integers 1,2,.... 10. The distribution 
of M is uniform (4,10). Thus if we let the t-configuration be 
(3,6,7,8,10), as in the case i = 6, we obtain 

- /0-5842 3-7372 \ 1538-9 \ 

_ \0-02836 0-2497 ) 3 ~ \ - 89-78/ 

W(EB) = ( “ 1 j'y + tr(BE G - B T ) + W G — tr (BLB T ) 

= 18369 + 52. 

The first term is W'(EB) for estimating the intercept, the second is 
W (EB) for the slope estimate. 

In the non-identical component setting of this example it seems 
virtually impossible to obtain an analogous result for E„J4''(EB). In 
order to obtain an estimate of E n W( EB) oue could have to generate 
n- 50 sets of observations and calculate a, B for each of them, then 
W( EB), and finally average the results. 

An alternative (quicker?), and perhaps more realistic calculation is 
to obtain the mean of the actual losses in the components of the 
realized observations. This mean is an estimate of the average 
expected squared error in n = 50 components. It can be regarded as an 
overall measure of performance of the estimation procedure in the EB 
sampling scheme. It is not an estimate of £„IT(EB) for a particular 
current component such as that exhibited above. For the one n = 50 
component realization referred to in Example 4.4.2 the following 


mean squared errors were obtained. 

Intercept estimate ML : 39888 

Bayes : 9014 

EB : 10858 

Slope estimate ML : 1268 

Bayes : 12 

EB 67 


Finally, we consider assessing the relative performance of ML, 
linear Bayes and linear EB estimates from the realized data set. The 
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linear Bayes estimate has 

W (linear Bayes) = tr(B£ G B T ) + oj g (BEB t ) 

and the difference E„W( linear EB)— Bilinear Bayes) is 

£„{(a - a) + (B - B)^ G } T {(a - a) + (B - B)/i g } 

+ E n tr(B - B)(E 0 + a» 0 E)(B - B) T . (4.4.15) 

Assuming the estimates of a and B to be approximately unbiased the 
latter expression can be rewritten in terms of the variances and 
covariances of the estimates of the elements of a and B. The first term 
in (4.4.15) becomes 

2 

Z ( var (^) + 2 ^gi cov(d;, b n ) + 2ho2 cov(o,, b j2 ) 
j- 1 

+ /4 1 var(h ;i ) + 2n G1 n G2 co\(b n ,b j2 ) + n 2 G2 var(^ 2 )}. 

The second term becomes 

"n{var(S n ) + var(S 21 )} + m 22 {var(S 12 ) + var(b 22 )} 

+ 2co 12 {cov(S u , b 12 ) + cov(S 21) S 22 )} 

where <u y are the elements of W = £ G + w c L. 

The value of W (linear Bayes) can be estimated by substituting 
estimates for the unknown quantities. To estimate £„ Bilinear EB) — 
Bilinear Bayes) we need estimates of the variances and cova¬ 
riances of &ij, by, i,j = 1,2. Since the estimates are obtained as the 
solutions of six estimating equations summarized in 

a + Bx = x 

ax T + Bjs-£ r i + d>rj = S- <5(l/n) £ T f 

the steps given in section 3.11 can be followed to obtain estimates of 
the required variances and covariances. 

4.5 The multinomial distribution 

We write the multinomial distribution as 

P(X:x\0) = P(X 1 =x 1 ,X 2 =x 2 ,...,X p = x p \0 u 0 2 ,...,O p ) 
m\ 

= e V o V ... e;P (4.5. 1} 

1 lj= 1 X J- 

where + x 2 + • ■ • + x p = m, + d 2 + ■■• + 6 p = 1 and 0 ^ x } ^ m 
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for every j. It is fundamental in the analyses of categorical data, typical 

cases which we shall consider being the following: 

1. Each of a number of subjects is given m questions or propositions 
to each of which p = 3 mutually exclusive responses is possible. For 
example, the responses might be strong agreement, strong dis¬ 
agreement, and neutral. Then each subject can be regarded as 
having probabilities 9 t , d 2 ,0 3 of registering a response in the three 
categories respectively, and the observed numbers of strong 
agreement, etc., responses are multinomial observations. If 6 — 
(9 U 9 2 ,9 3 ) varies randomly from subject to subject we have, for n 
subjects, a sequence 0 h i= 1,2,...,n of parameters, and observ¬ 
ations Xj, i = 1,2,..., n in a typical EB sampling scheme. 

2. Cross-tabulated data in contingency tables. The simplest of these is 
a 2 x 2 contingency table. Such tables arise in a great diversity of 
applications. As an example, consider that a randomly selected 
group of subjects is randomly partitioned into two subgroups, one 
of which is treated with a certain drug, the other being a control 
group. The observed response is recovery or non-recovery from a 
certain condition. The results of such a trial can be summarized in a 
2x2 contingency table. In a drug screening study one may have a 
sequence of drugs which could be regarded as having been sampled 
from a population of drugs, the trial of each drug giving rise to a 
2x2 contingency table, thus again providing an EB setting for 
the multinomial distribution. 

3. Univariate data summarized in a grouped frequency distribution, 
where the observed frequencies in the groups can be regarded as a 
multinomial observation: an approach to Bayes and EB estim¬ 
ation of distribution functions or quantiles is possible with the 
above setting as starting point. 

4. Two-way contingency tables with n rows and p columns. If the 
rows can be regarded as corresponding to realizations of a random 
effect the results of each row can be taken as a multinomial 
observation, with randomly varying 0. 


4.5.1 Estimation with Dirichlet priors 

The Dirichlet prior density for 0 1 ,0 2 ,..., 9 P is 


9(0) = 


r(<Xi + a 2 4- 


+ a p) 


0 ?*“ 1 0^- 1 -9l" 


'Clip 1 

P 


(4.5.2) 


najna.j-na,) 

the conjugate prior for the multinomial distribution. Straightforward 
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calculation gives the posterior distribution of 0 given x as 

p QCLi+Xi- 1 

g(0\ x) = T (A + m) fl ~v , , (4.5.3) 

i= i r(a ; + Xi) 

where A = + a 2 + ■■■ + a p . It is also a Dirichlet distribution. Also 

£(flr|x) = ^ Xr , r = 1,2, ...,p (4.5.4) 

A + m 

and the expected squared error of this estimate of 9 r is a r (A — a r )/ 
{A(A + 1 )(A + m)}. If we take lE(Bayes) to be the sum of the squared 
errors we have 


lE(Bayes) = 


A(A + l)(A + m)' 


(4.5.5) 


In order to estimate the parameters a u <x 2 ,...,<x p of the prior 
distribution in the empirical Bayes setting, recall that we consider n 
past realizations 0 lt 0 2 ,..., 0 P of 6, unobserved, and the correspond¬ 
ing observations x 1( x 2 ,...,x„. Estimation by the method of max¬ 
imum likelihood is straightforward in principle, but we shall here deal 
only with estimation by the method of moments. In the marginal X- 
distribution we have 


E{Xj) - moij/A 


E{Xj(Xj- 1 )} 


_ m{m - l)oij(Uj + 1) 

~ A(A + 1 ) ’ 


j=l,2,...,p. Using the first moment equations we can let the MM 
estimates of a u <x 2 ,...,a p satisfy 

Xj = mij/A = mjij, j = 1,2,..., p. (4.5.6) 

Also, using the second moment equations we let 


Q-'-ii mh- 1 )■-( 4 . 5 . 7 , 

«i=u=i A(A + 1) 

Writing B — 01+01 + —h fi 2 p , the estimates are 
A = (l-Q)/(Q-B), for Q>B, 
ot-j = Afij, j = 1,2, ...,p. (4.5.8) 


When Q < B the prior distribution is estimated as being degenerate at 
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(fix, j? 2 , • • • > ftp)' This means that if each component is in turn treated as 
the current realization, every one will have the same estimated By, this 
is just the limiting case of ‘shrinkage’ estimation. 


4.5.2 Linear Bayes and EB estimation 


In previous discussions of multiple-parameter Bayes estimation we 
have seen that the loss can be taken as a sum of squared errors in 
deriving the Bayes estimate. This means, effectively, that we need 
consider only the estimation of individual parameters. To illustrate 
derivation of linear Bayes estimates for the multinomial case we 
consider the trinomial distribution; the more general case can be 
treated a similar way. 

Let 8x be estimated by a l + b n (x,/m) + b 12 (x 2 /m). Then the 
expected squared error of estimation is 

f I (ax+bxx^ + bx^-exXdGiOM 

Jtx ,.x 2 \ m m ) 

and we minimize it w.r.t. a t , b n , b 2 2 . The first and second moments of 
G(0,,0 2 ) are £(0^ = /x 1G , E(0 2 ) = n 2G , var (0 t ) = o\ G , var(0 2 ) = o\ G , 
cov(0 1 ,0 2 ) = o l2G , and differentiating w.r.t. a l ,b 11 ,b 12 the following 
equations are obtained for the optimal a l ,b ii ,b l2 , after some 
manipulation. 


a l + 01l/*lG + 012^2G — Mig 

I <r iG-U<r 2 iG + Mi G ) + -MiG 
m m 


I °'i2G~“( <J 12G + Mig/I2g) 

V m 



2G (& 12 G + blGbl g) 
m 

oIg-~(°Ig +via)+ --H2G 
m m 

(4.5.9) 


If the prior distribution is Dirichlet as given in (4.5.2) n 1G = 
b2G = * 2 M, o{ G = <Xx(A-a.x)l{A 2 (A+ 1)}, o\ G = <x 2 (A - <x 2 )/{A 2 (A 
+ 1)}. ffi 2 G = — ai« 2 l{A 2 (A + 1)}. Substituting these values in (4.5.9) 
gives b 12 = 0, b n = m/(A + m), a t = cc L /(A + m), in agreement with 
(4.5.4). 
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In the marginal X-distribution we have 
E(X r ) = mp rG , r = 1,2, 

var (X r ) = m(m - 1 K 2 e + mp rG (l - p rG ), r = 1,2, (4.5.10) 
cov(X' 1 X' 2 ) = m(m - 1)ct 12G - m/r 1G ^ 2G . 

Replacing the left sides of (4.5.10) by the empirical moments 
calculated from n past realizations we obtain equations from which 
estimates of p rG , o? G , r = 1,2 and a J2G can be calculated. 

4.5.3 Non-identical components 

In many realistic applications of the ideas of sections 4.5.1 and 4.5.2 
one may expect m to vary from one component to another; thus, at 
past component i we have m, instead of m. The formulae for Bayes 
estimation at the current component do not change, but allowance 
has to be made for the variable m in estimating relevant parameters 
for EB estimation. 

Dirichlet prior, for the ith component we can write the marginal 
expectations as 

E(Xf) — mflj/A 

E{Xf{Xf - 1)} = m,(m f - l)a,(a, + \)/{A(A + 1)}. 

The estimating equations (4.5.6) and (4.5.7) become 

Xj = XjthjA 

1 « p f 1 " ) 

-it x a( x Ji -!)=\ - i m Mi -!) [ 

x(t«] + Ajl{A(A + l)}, (4.5.11) 

where x is the mean of the values. 

An alternative form of (4.5.11) is 

1 " _ _ 

- z = Pj> j = 1,2,...,p, 

n i = i 

Q = l t i XjAicj - 1)/Mm, - 1)} (4.5.12) 

nj= i ,=i 

=(.z«. ? +^)/{^+m- 
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Putting B = Pl + p I H-b ft* as before, the estimates are given 

again by equations (4.5.9). 

An example of the application of these methods is given in 
section 8.4. It concerns a contingency table studied by Laird (1978) 
using a different model. 

Linear EB: instead of the relations (4.5.10) we use 
E(X?) = m i p rG , r= 1,2 
E{X?(X® - 1)} = mM - 1 )tf G + r = 1,2 

E(Xf Xf) = m((m, - 1 )(<t 12G + /i 1g /*2g)- 

Replacing the l.h.s. expectations by appropriate mean values we 
obtain 


(1 /n)f J x ri = fi r(i m, r= 1,2 

i= 1 

(l/n) £ x ri (x ri - 1) = (<Lg + /&)(!/») £ m i( m i ~ 1) (4.5.13) 

i = 1 *=1 

n n 

(l/n) £ JCliJC 2 i = (<Tl 2 G + /iiG/i 2 G)(V«) £ — !) 

i=l i*l 

from which to calculate estimates of the parameters in (4.5.9). 


4.5.4 Simple EB estimation 

For this discussion it will be convenient to use the notation x (m) for the 
multinomial vector resulting from m trials. Also, let J t be an operator 
such that 


J X 2 , •••» %p) *^ 7 (*^1 > X 2 ,.. •, Xj ~b 1,..., Xp ). 

Then the Bayes estimate of can be expressed as 


where 


m i f 

pg.„(x<">)= - , Y r y i ove?-e x /dG(e), 

Xj!X 2 ! ■■•X,,! J 


(4.5.14) 


the marginal x-distribution. Since we actually observe the outcomes 
of m trials only we have to use a version of (4.5.14) with m replaced by 
(m — 1), when application to EB estimation is considered. Under the 
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EB sampling scheme, for fixed m, a direct estimate of p G m (x <m) ) can be 
obtained. By omitting one of m observations in every component a 
direct estimate of p G , m _ i(x m “') is obtainable, and schemes such as 
averaging estimates generated by successively omitting every observ¬ 
ation in turn can be considered. 

An alternative approach, analogous to that of section 3.4, is to let, 
for i # p, 

Ki^(x u x 2 , ... ,x p ) = &(x u x 2 , ..., x, + 1,..., x p - 1). 

Also let B be the posterior distribution function of 0, and E B 
expectation w.r.t. B, i.e. posterior expectation. 

Then 


= (4.5.15) 

Formula (4.5.15) shows that a simple EBE of the ratio 6J9 p can be 
derived from the observations x^ m) , j — 1,2,..., n on the marginal X- 
distribution. 

An approximation for £ s (#,) can be derived, following Maritz and 
Lwin (1975). We need relations 


KiKjPG'Jx^) 


4 m H m) -1) F Ml 

(x| m) + l)(xj m) + 1) b 1 0 p) 


(4.5.16) 


for i,j = 1,2 ,...,p- 1. 

Write A e = 0;/0 p , i = 1,2,..., p, so that = A^Aj + • • • + A„). 
From (4.5.15) and (4.5.16) we can get expressions for the posterior 
moments of order 1 and 2 of the Aj, and can derive approximations for 
the posterior expectations of the 9 { by the usual Taylor expansion 
technique. 


4.6 Linear Bayes and EB, and subsets of parameters 

Linear Bayes and EB estimation has been considered in the special 
cases dealt with in earlier sections of this chapter. Now we look at it in 
a slightly more general setting. 

Suppose that wij observations on the vector r.v. X at the ith 
component yield unbiased estimates t f of parameters 0 f , i = 1,2,..., p 
and also estimates co of parameters a>. It is convenient to think of 
0 as the parameters of primary interest, and (o as nuisance para- 
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meters. For example, in the case of the multivariate normal distri¬ 
bution, ft might be the vector of means, to the collection of dispersion 
parameters. 

A linear Bayes estimate of ft, derived from t, is obtained by 
minimizing the expectation 
* /* 

(a + Bx — ft) T (a + Bx - 6)dF(t\0)dG(6, <a). 

« B %/1 

Let 2,(8, oi) be the covariance matrix of x for given 6, <o, 2 0 the prior 
covariance matrix of 6, and p G the prior expectation of ft Then the 
optimal a and B are given by 

a + B n G = fi G 1 (46n 

aftl + B{n G nl + 2 g + E g 2(9, to)} = /t G /il + L c J 

These equations are essentially like (4.4.10). Eliminating a from these 
equations, B is given by 

B{E C + E a 2(0, ©)} = 2 g . (4.6.2) 

The l.h.s. matrix in braces is the marginal X covariance matrix. 
Therefore, in the EB sampling scheme, where the conditional 
covariance matrix at the ith component is L,(ft;, ®j) we can estimate 
Z G by 

£ c =s -li% 

where 

S = - £ xJxi-x T x 

and £, is an estimate of the conditional covariance matrix at the ith 
component. The mean vector n G is estimated by x, assuming that the 
parametrization is appropriate. 

In general the attraction of linear EB estimation is its relative 
simplicity, but there is the drawback that the number of elements of ft G 
and Z G to be estimated increases rapidly with p. Referring to 
Example 4.5.1, with n = 16, p = 11 the number of parameters in fi G 
and Z G is 66, a seemingly excessively large number given the size of the 
data set. A question which arises is whether much is lost if the 
dimensionality of the problem is reduced. The following examples 
illustrate the effect of reducing dimensionality. 
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Example 4.6.1 Consider the multinomial Dirichlet prior case. The 
Bayes estimate of 0 X is given by (4.5.4) as 

E(6 1 |x) = (a x + x t )/04 + m). 

If we pool the classes i = r,r+ 1,..., p, r $5 2 simple calculations show 
that the Bayes estimate of 6 1 remains unchanged. 

Example 4.6.2 ^^ distributed A J 2 jj, prior dis- 

tritation of (®|) is Then the Bayes 

estimate of 6^ is 

£(0j |x) = - 0-0782 + 01558X! + 0-0782x 2 

and its expected squared error is 0 0519. 

Ignoring information about d 1 supplied by observation on X 2 , i.e. 

simply using the fact that X = N(8 l , 1/3), 0 t = N(0, 1/15), we have 
£(0 1 |x 1 ) = X|/6 with expected squared error 00555. The expected 
squared error of Xj, the MLE of d l , is 1 /3. Thus the loss in precision by 
reducing the dimensionality of the problem is negligible in this 
example. 

Example 4.6.3 A slightly more general version of the previous 
example is to take 



and 

(YMig\ / O'lG / 7c7 lG t7 2G\} 

@2/ \\ 82 aJ \P a 1G G 2G a 2G )) 

Then the expected squared error of the Bayes estimate of is 
_ 1/gGZ + (1 ~ P 2 )/^ 1 _ 

1 1 1 (1-V) 

-5 2- + -!-^+^— 2+ 22 

GgiOg 2 aG2<*l <*G1°2 ^2 

At p = 0 we have the usual one-parameter result (1 /ogi + l/tf?) -1 , 
and this is the maximum w.r.t. p. The minimum is at \p\ = 1 and 
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is 

1 

{1/0G1 + V°\ + olzKoliGi)}' 

This formula shows that a Bayes estimate of 0 1 involving x 2 as well as 
x A will only be appreciably better than the Bayes estimate involving 
Xj alone if > s large. In Example 4.6.2 0 g 2 /( ct gi° 2 ) = 3, 

compared with l/o|i + 1 /<rf = )8 so that the largest and smallest 
possible expected squared error values of the Bayes estimates are 0 05 
and 0 0476. 

The discussions in the immediately preceding examples are not 
conclusive, but they do suggest that EB estimation with reduced 
dimensionality, thus requiring estimation of fewer parameters, may 
be better than EB estimation in higher dimensions. In other words, if 
data are available in an EB sampling scheme one may do better to use 
non-EB estimates of nuisance parameters. Making general qualita¬ 
tive statements does not seem possible; individual cases may have to 
be examined with careful analysis of available data. 

4.7 Concomitant variables 

In practical cases where application of EB methods might be 
considered appropriate there will often be concomitant information 
about the parameter values. Specifically, recall the EB sampling 
scheme where we have observations {x 1 ,x 2 ,.. .,x n ) when the para¬ 
meter values are (A t , k 2 ,..., /„). Every x f is usually thought of as an 
estimate of the corresponding A f . Now, it may happen that we also 
have associated with every x f an observation c f on a concomitant 
variable C. Every c ( is not necessarily directly an estimate of X it but C 
and A may not be independent, so that taking account of the observed 
c should improve the estimate of A. However, the emphasis is still on 
estimating individual A values, and not on exploring the relation 
between A and C, as one might do by examining the regression of A on 
C. The paper by Tsutakawa, Shoop and Marienfeld (1985) pays some 
attention to concomitant variables in the EB context. Other examples 
involving concomitant variables are discussed by Raudenbush and 
Bryk (1985) and Fay and Herriot (1979). More details of these 
examples are given in Chapter 8. In the discussion that follows we 
consider just a one-dimensional concomitant variable C; extension to 
vector C is not difficult. 
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The details of incorporating concomitant information are rela¬ 
tively straightforward if the joint distribution of A, X and C is normal. 
Retaining our earlier notation, the model is: 

A = N(fi G , <r|) 


X\X = N{W g ) 


A| C = N I p AC — (c - n c ) + p G , <j 2 g {\ - p\c) 

( <^c 

In the formulae above p c is the mean of C, a G ‘ ts variance, and p AC is 

the correlation of A and C. From these specifications the joint 

distribution of A, X and C has mean vector (p G ,p G ,Pc) T and 

covariance matrix as follows: 

/ 2 2 \ 

a G Pa c g g g c 

o 2 g a 2 + a G P\c° G °c 
x P\C (J G a C P\C°G a C a c. J 
The Bayes estimate of A, i.e. E(A|x, c) is now 
£(A|x, c) = p G + (l/D)(o G P\c a G a c) 

°c ~ P\c a a a c \( x — 

- P\C a G a C V 1 + °G Ac -Pc/ 
where D = <7 2 <7 2 + (JcO G {\ — p AC ). Also, 

var(A|x, c) = o 2 o 2 G o 2 c { 1 - p 2 AC )/D. 

When p AC = 0 these formulae reduce to the previously established 
forms for £(A|x) and var(A|x). 

From the point of view of EB estimation it is useful to rewrite the 
formulae for E(A\x,c) and var(A|x,c). First we note that 

E(X | c) = p AC (a G /(Tc)(c ~ p c ) + Pa = Vo +7 i » 

and 

var (Af | c) = a 2 + (1 - p 2 AC )a 2 G = a 2 + x 2 . 

Then we have 



-X + ; 


(a 2 + t 2 ) (a 2 + r 2 ) 


E(X\c). 


£(A|x,c) 


(4.7.1) 
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Both E{X\c) and var (X \ c) depend on the parameters of G, and when 
A and C are independent, equation (4.7.1) reduces to the usual 
formula for the Bayes estimate of A. 

For EB estimation the form of (4.7.1) suggests that one should look 
for estimates of y 0 , y x ,a 2 and r 2 . We consider the unequal compo¬ 
nents case where x, is replaced by the mean x h the mean of m, 
independent observations at component i with var(^,|A i ) = 
<r 2 / m i = v i- We shall assume that a 1 is known. If it is not, an estimate 
can be obtained as the usual within-groups sample variance. Now 


E(T t \c,) = y 0 + y 1 c t 
var(T i |c i ) = v, + T 2 , 

and we write v ( + t 2 = l/ojj. Conditioning on the observed c t values 
and maximizing the likelihood of the observed x ( values we obtain the 
following equations in y 0 , y 1 and t 2 to be solved for their ML 
estimates: 


and 


( X>. I VfoW 2>i*i \ 

2>iC 2 AvJ 


(4.7.2) 


V ^ _ y (^i Vo Vl^i) 

L (Vi + r 2 )~ L (v, + T 2 ) 2 


(4.7.3) 


In the equations above every £ should be read as l . Solving them 
iteratively can be accomplished by starting with a trial value for t 2 , 
then solving the two linear equations for the trial values of the y’s, then 
checking equality of the left and right sides of the third equation and 
adjusting the trial r 2 appropriately. 

Finally we may note that equations like (4.7.2) and (4.7.3) for 
estimates of the parameters y 0 ,y! and r 2 can be derived using a 
weighted least squares approach without appealing to normality of 
the underlying distributions. 



CHAPTER 5 


Testing of hypotheses 


5.1 Introduction 

In the non-Bayes approach to hypothesis testing a sharp distinction is 
usually made between the null hypothesis and alternative hypotheses. 
The null hypothesis holds a special place, a notion which is reinforced 
by the terminology ‘testing of hypotheses’. The Bayes approach, or at 
least the Bayes decision theoretic approach, is different, emphasis 
being on the choice between hypotheses rather than one being singled 
out for special attention. There seems to be no natural Bayesian 
counterpart to the testing of a null hypothesis against a vague 
alternative which is just its negation. For a discussion of this and 
related matters see Cox and Hinkley (1974, p. 392). 

From the point of view of decision theory the essential difference 
between point estimation and choosing between hypotheses is in the 
form of the loss function. Commonly a ‘0-T loss function is used in 
the latter context, i.e. the loss is taken as 0 if the correct choice is made, 
otherwise it is 1. Other loss structures are of course also used, and in 
section 5.4 we shall discuss one which has found favour in EB theory. 

The simplest types of hypothesis testing problems have to do with 
choice between k simple hypotheses. They can also be regarded as 
problems of point estimation where the prior distribution actually is 
discrete, having atoms of probability at X 1 ,X 2 , . ..,A„, say. However, 
here the loss is 0-1 and we do not take the mean of the posterior 
distribution as the point estimate. 

Whatever loss structure is considered, the EB approach to 
hypothesis testing is in principle the same as to point estimation. 
From past data an estimate of the prior distribution is made, and it 
replaces the actual prior distribution in the Bayes decision rule, thus 
producing an EB decision rule. With a special loss structure such as 
that discussed briefly in section 1.5 it is possible to avoid the process 
of finding an explicit estimate of the prior distribution G. In other 
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words, a ‘simple’ EB approach to hypothesis testing is possible as in 
point estimation. We shall take this up again in section 5.4, but in 
other parts of this chapter we shall use 0-1 loss unless another 
structure is specified. 

5.2 Two simple hypotheses, one-parameter problems 


5.2.1 Single past and current observations: m t = m = 1 

The random variable X has distribution function F(x|A) and A is 
known to have one of two given values A, or A 2 ; A, < A 2 . The prior 
probabilities of and A 2 are 0! and 0 2 = 1 — 9^ With the 0-1 loss 
function the Bayes rule is to choose A x when the observation x is 
such that d 2 f(x\X 2 ) < (^/(xIAj), where /(x|A) is the p.d.f. of X. 
Modifications for discrete X are obvious. If the ratio f(x | A x )//(x | A 2 ) 
is monotonic in x, the Bayes rule is: choose A x if x < £ G where x = £ c 
is the solution of 


0 2 /(x|A 2 ) = e i /(x|A 1 ); (5.2.1) 

see also section 1.4. 

In this case, estimation of the prior G in order to derive an EB rule 
reduces to estimating 6 1 . Letting x^Xj,...^, be the past observ¬ 
ations, and noting that the marginal p.d.f. of X is 0,/(x| A,) + 
0 2 /(x|A 2 ) one can estimate 8 : by maximizing the likelihood 

L n =t InC^i/(xj|A t ) + (1 —0i)/(xj|A 2 )] 

i= l 

w.r.t. 0 X , subject to 0 < ^ 1. Thus the estimate of can be 

obtained as the solution of 


f /(XjlAJ — /(x,[A 2 ) 
.40,/(x i |A 1 ) + (l-0 1 )/(x,.|A 2 ) 


(5.2.2) 


if the solution lies between 0 and 1; otherwise it is taken as 0 or 1, 
whichever gives the greater L„. 

Finding the solution of (5.2.2) is usually rather awkward, and 
simpler alternative methods of estimating have been used. For 
example, suppose that E(X | A) = A. Then the mean of the mixed 
distribution F G (x) is 


xdF G (x) 


JJ 


xdF(x|A)dG(A) = A 1 0 1 + A 2 0 2 . 


(5.2.3) 
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Thus if x is the mean of the past observations, (5.2.3) suggests 
estimation of 8 L by 9 1 from 

x = 9 1 k 1 +(1 -9^, 

giving 

(2 2 — x)/(2 2 — 2,), for2!<x<2 2 
8 1 = ' 0, for x > k 2 (5.2.4) 

1, forx^2 t . 

(a) An example: the case of normal F 

Suppose that the distribution of X for given 2 is N(k, 1) as in 
Example 1.4.1. Then £ G is given by (1.4.2), and using the estimate of 8 X 
given by (5.2.4), the EB rule is: choose k l or k 2 according as x < f G or 
x > £ g , where 

'k 1 + k 2 
2 

Zg=\ 

- oo, 

. 4“ oo, x ^ k j . 

The expected loss incurred by using the rule (5.2.5) is W(% G ) given by 

W(Sg) = Oy{l- ®(& - 2J} + 0 2 ®(f G - 2 2 ), (5.2.6) 

where ®(u) is the standard normal distribution function. 

In order to evaluate E„W(£ a ) we need the distribution of f G . Now, 
the distribution of x will be well approximated by a normal 
distribution for quite small values of n, because the marginal X- 
distribution is itself reasonably close to normal when 2 2 — k 2 is not 
large by comparison with var(2f |2), which equals 1 in this case. For 
example, with k 2 = + 1, k t = — 1, = 0-8, the coefficients of skew¬ 

ness and kurtosis of the marginal X-distribution are 0 069 and 3 038. 

The values of E„ W(f G ) in Table 5.1 were calculated using a normal 
approximation for the distribution of x, the expression (5.2.5) of f in 
terms of x and numerical integration. Various values of 9 1 and of the 
difference between 2 t and k 2 were used. Two factors influenced these 
choices. When 9 l = 0 2 the Bayes and ‘best’ non-Bayes rules coincide, 
so that the EB rule must necessarily give a worse result than T. As 
9 1 —► 0 or 1 the Bayes rule becomes relatively more effective than T. 
Also, for fixed 9 X , the Bayes rule tends to be relatively less effective 
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Table 5.1 Two simple hypotheses, H 1 :k = k t , H 2 '-k = X 2 with 
prior probabilities 0 l and 1 — 0, 


0 i 

W(T) 

mu 

n = 10 

3^ 

II ^ 
oS?" 

n = 100 

(i) A, = 
05 

— u 2 = 

0159 

+ 1 

0159 

0195 

0164 

0162 

06 

0159 

0154 

0180 

0160 

0157 

07 

0159 

0139 

0173 

0145 

0142 

08 

0159 

0112 

0142 

0120 

0116 

09 

0159 

0069 

0-091 

0078 

0075 

(ii) k x = 
05 

- 2, A 2 = 
0023 

= +2 

0023 

0026 

0023 

0023 

06 

0023 

0022 

0-029 

0-023 

0022 

07 

0023 

0-020 

0-032 

0-021 

0021 

08 

0-023 

0-017 

0035 

0-019 

0-018 

0-9 

0-023 

0-012 

0-032 

0-016 

0-013 


as X 2 — A t -♦ oo. When 0! is not close to 0-5 considerable advantage 
may be gained by the EB approach. This emphasizes the need, when 
contemplating using an EB method, of having some idea as the spread 
of the prior distribution; in the present case the need is for preliminary 
information on 9 X . 

When a situation of‘least favourable’ prior distribution is encoun¬ 
tered, for example d 1 = 0-5, the EB method is less ‘good’ than the best 
conventional method. On the other hand, the results of Table 5.1 
indicate that the possible gain in using the EB method for favourable 
values of is much greater than the loss in using it when 9 X is 
unsuitable. For example for n = 100, 9 X =0-5, E m W(lj G ) — W{T)~ 
0-162-01587 = 00033, and the ‘%loss’ is 100(00033/0-1587) = 
2-1%. But when 0 1= O-9, W{T)~ E„W(£ G ) * 0-1587 - 0-75 = 

0-0837, and the‘%gain’ is 100 (0-0837/0-1587) = 52-7%. These results 
make a strong case for the use of EB methods in the present context. 

The preceding discussion raises the question of judging the 
effectiveness in practice of the EB rule relative to non-Bayes 
procedures, i.e. when a set of past observations is given. Here, 
estimates of E n W(/; G ), W(£ G ) and IT(Bayes) are needed. All of these 
quantities are functions of 6 for given n, X u X z . Thus, if 9 X is an 
estimate of 0 X , the corresponding estimates of IT(Bayes) and E n W(^ G ) 
are obtained by replacing 0 by 9, in the appropriate calculation. If 
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A t = — 1, X 2 = 1 the tabulated values of the two functions given in 
Table 5.1(i) can be used, with interpolation, if necessary. 

As an example, suppose that n = 100, X j = — 1, A 2 = + 1, x = — 0-63. 
Then 9 1 = 1-63/2 = 0-815, and the estimates of W(£ a ) and E„W(£ 0 ) 
are respectively 0-107 and 0-111. 

Confidence limits for W(£ G ) and E n W(^ a ) can be obtained by first 
getting confidence limits for 6 l . Approximate normality of the 
distribution of x and formulae 

E{x) = 1 2 + (A 2 -2 1 )0 1 
var(x) = {1 +(2 2 -A,) 2 0 1 (l -0i)}/n 

are used in straightforward calculations to find the confidence limits 
for 0j. Then interpolation in Table 5.1 can again be done to find the 
other confidence limits. 

For the example above, two-sided 80% confidence limits for the 


parameters listed below are 



E(x): 

- 0-468, 

-0-792 

0 C 

0-734, 

0-896 

W(Zc) : 

0-072, 

0-131 

E n wa G y. 

0-076, 

0-133 


Confidence limits for the particular W(£ G ) can be found more easily 
by substituting £ G and the two limit values for 9 1 in (5.2.6). In the 
example above the results are 0-078,0-138. The similarity of the limits 
for W(% G ), E„W(^ (i ), W(£ g ) in this example is accidental. In general it 
is possible for the upper limit for W(£ G ) to be substantially greater 
than W(T), but by its definition the upper limit for W{c, G ) cannot 
exceed W(T). 

Finally, by following steps like those above, we can find a point 
estimate of and confidence limits for the difference E n W(£ a ) — W(£ a ). 
In our example the results are: point estimate = 0-004; two-sided 80% 
confidence limits 0 003, 0-006. 


5.2.2 m, m t ^ 1 not necessarily equal 


If m current observations x T = (x,,x 2 ,...,x m ) are made independ¬ 
ently on X, Aj being chosen if xeA lt A 2 if xeA 2 , the expected loss is 


0 ! 


f{x\2 1 )dx + 9 2 

J xeA 2 



f{x\l 2 )dx. 
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where /(x | A;) = n?=i/C*iUA j = 1,2. The Bayes rule S G (x) chooses 
A 2 if d 2 f(x\X 2 ) < «i/(x|Ai), and A 2 otherwise. If a one-dimensional 
sufficient statistic, t, exists this rule reduces to choice of Aj if 
0 2 /(t|A 2 ,m) < 0i/(£|A 1; m) in an obvious notation. 

Estimation of d 2 and 0 2 remains a problem. Suppose that the 
observations at the ith past component are x i} , j= 1,2,..., m ( . Then 
the likelihood of the entire collection of past observations can be 
expressed as 

n C mi m, 

n - 

i=i (. j=i j=i j 

which simplifies to 

C(x) n {0J{ti\K m t ) + 0 2 /(£ f |A 2 , ntj)} 

i = 1 

when sufficiency of t for A holds. This leads to an equation similar to 
(5.2.2) for finding the MLE of 

Again, it may be simpler to use the method of moments. For 
example, if E(X \ A) = A, suppose that x ( is the mean of the observations 
at the ith component and let x. = 'EUi x i /n. Then we can take as an 
estimate of the solution of 

x. = Mi+(1-0^, (5.2.7) 

truncated at 0 and 1, as appropriate; see also (5.2.4). Refer also to 
Example 2.5.1 which gives an illustration of the implementation of 
this method. 

If no sufficient statistic exists, simplification through the use of an 
estimate of A can be considered, in much the same way as was done in 
point estimation. The idea is to base the decision about A on the 
observed value of an estimate of A. A difficulty here is that the 
distribution of an estimator, for example the MLE, is usually not 
known exactly. If m is sufficiently large it will often be possible to 
approximate the distribution of the estimate by a normal distribution, 
leading to relatively straightforward calculations. 

5.2.3 Bayes cut-off rules 

Suppose that X is an estimate of A and that we consider rules of the 
type: 

choose A j if X ^ and A 2 if A > £. 
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Such a rule may be referred to as a cut-off rule, £ being the cut-off 
point. The expected 0-1 loss under this rule is 

W(t) = 0,(1 - m)} + e 2 vm 2 , m), 

where *F(X| 2, m) is the distribution function of X. The Bayes cut-off 
value is defined as the value of £ which minimizes the expected loss. 
Under suitable conditions it can be obtained by differentiation of 
W(^) w.r.t. £, so that it is the solution of 

- d^mi, m) + fl 2 'P«|A 2 , m) = 0. (5.2.8) 

In certain special cases the Bayes cut-off rule is the actual Bayes rule, 
but in general it is sub-optimal. 

If 0, is an estimate of 0 t derived from past observations in an EB 
sampling scheme, substitution in (5.2.8) leads to an estimate of the 
Bayes cut-off. We now have, approximately 

E n W(Z a ) = W($ G ) + (1/2) var (J G ) W"(£ G ). (5.2.9) 

Note that W"(£ G ) > 0. Following the methods given in section 3.4 an 
estimate of var(f G ) can be calculated. This enables us to make an 
assessment of the relative goodness of the Bayes cut-off rule in 
practice. 

5.2.4 Nuisance parameters 

Suppose that the distribution of X, F(x|2, co) depends on the 
parameter of primary interest, X, and an unknown nuisance para¬ 
meter, co. A typical example is the N(X,co) distribution, where the 
variance co is the nuisance parameter. Two cases are worth 
distinguishing: 

1. the parameter co is fixed, i.e. its marginal prior distribution is 
degenerate at co, 

2. co has a non-degenerate prior distribution. 

In case 1 an estimate of co will usually be obtainable if m or some of the 
Wj values are greater than one. Then the EB rule is constructed as 
before with co replaced by its estimate. Case 2 can be complicated, and 
the simplest approach seems to be through assumption of a 
parametric form of prior distribution for co. Further simplification 
can be effected by using the non-optimal strategy of restricting 
decision rules to be cut-off rules based explicitly only on the natural 
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estimate of A. As usual, all calculations are much simplified if low¬ 
dimensional sufficient statistics exist. We give two examples illustrat¬ 
ing this discussion. 


Example 5.2.1 Let X = N(X, 0), P{ A = A,.) = 0,, j = 1,2, P(Q = to) = 
1, i.e. we have case 1 above. The natural estimate of to is the within- 
groups mean square 

mi 


where x tJ is the yth observation at the ith past component, x t . is the 
mean of the ith group of observations and M = m l + m 2 + —h m„. 
Estimating 0, according to (5.2.8) the EB rule is 

choose A x if x < £ G , X 2 if x > £ G 

where x is the mean of the current m observations, and 


X (Xij-Xt.), 
j= i 


03 ( M-n ) ,?i 




'X,+X 2 (b/m [x.-xn 

I - 00, 


X 2 < x < X 2 
x. > X 2 


(5.2.10) 


f + oo, 


x. 


In order to compare the performance of this rule with that of the rule 
given by (5.2.5) consider the following special case: m, are independent 

realizations of M = 17(1,9), m = 5,X 1 = — 1/^/5, X 2 = + 1/^/5, to = 1. 
Here we have 


£(A) = (0 2 - 0J/V5, var(A) = {1 ~(0 2 - 0 1 ) 2 }/5, 

E(x.) = (- 9 1 + 0 2 )/7 5, var (x.) = {coE(l/M) + var(A)}/n. 

Taking n = 100, 0 X = 0-8, 0 2 = 0-2 and using the approximation 


(5.2.9) we have { G = 0-6931/V5, H^(^ G ) = 0-112, and 

^"(^) = 25{0 1 (^V5 + 1)^(W5 + 1)-02(^V5-1)«/>(W5-1)} 

= 25(0 15225). 


If we take <x> = 1 as known we should get a result close to the entry 
at n= 100, 0 t =0-8 for £„W(<f G ) in Table 5.1. In fact (5.2.9) gives 


£„ W(Zg) - 0-0041 + W(£ g ) = 0-1163 
and the table entry is 0116. 
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Taking account of the variability in the estimate of to gives only a 
slight increase in the expected loss; E n W(q G ) ~ 0-1168. 


Example 5.2.2 Suppose that X = N(k, to) and that the conditional 
p.d.f. of W\ A = Aj is 

h(co 1 kj) = { (Aj/ 2 y l2 /r (v/ 2 ) }vv _u + v,/ 2 exp(— AJlai), j- 1 , 2 . 

Let x be the mean of the m current observations and Q = £?=i(x; 
— x) 2 ; these are sufficient statistics for A and a). Suppose that a cut-off 
rule is adopted, i.e. choose k t ifx < k 2 otherwise. Then the expected 
loss is 


where 


w(i) = e l 
+ 6 2 


J 

U 


f(x | A ,, (o)h(a> | A j )dx d w 
f(x | A 2 , co)h(a> | k 2 )dx dw 


f(x | kj, (o) = ml2 exp [ - {m(x - kj ) 2 + Q }/( 2 cu)], 7 = 1 , 2 . 

Performing the appropriate integrations, the Bayes cut-off is seen to 
be the solution of the following equation in 

Mi' 2 0 2 A* 2 

{m({, -k l ) 2 + A i y m+ ' I V 2 ~ 1 ~ {m(£ - A 2 ) 2 + /l 2 } (m+v)/2-1 
which can be reduced to a quadratic in £. 


In order to construct an empirical version of this rule, estimates of 
6^62 = 1 —6 i ,v,A 1 ,A 2 are needed, and can be obtained by the 
method of maximum likelihood, or the method of moments, or 
otherwise. Using the method of moments, let x ( , (b t be the estimates of 
A and co at component i. Then, using the symbol-*-to mean ‘is an 
estimate of’, we have: 

1. x 2 + ••• + x n -^nd 1 k 1 + n 0 2 k 2 

2 . xj + —I- x 2 -» n(O t kj + 0 2 k 2 ) 

+ ((Mi +e 2 A 2 )/(v-2))(\/m 1 + ••• -(- 1 /m n ) 

3. Xjt&j -I- —I - x n (b„-m(k 1 A 1 9 1 + k 2 A 2 6 2 )/(v — 2) 

4. + —F cb„-*n(9 1 A l + 9 2 A 2 )/(v — 2). 

The reason for writing-*instead of=is that solutions of the equa- 
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tions thus obtained may not exist. We shall refer to relations 1, etc., 
as equation 1, etc. when -> is replaced by = . 

Suppose that a non-trivial solution to equation 1 exists, i.e. a 
solution 0 X such that 0 < d l < 1. Then taking a trial value v (0) of v, the 
linear equations 2 and 3 can be solved for and /L 2 0) . If both of these 
are greater than 0 equation 4 can be used to calculate a new trial v, this 
process being repeated until all equations are satisfied, or until the 
magnitude of the difference between the left and right sides of 
equation 4 is minimized. 

5.3 k > 3 simple hypotheses 

5.3.1 m i = m= 1 

Let ij < A 2 < ■ ■ ■ < A* be given possible values of the single parameter 
A, and 0 1( d 2 ,...,d k the prior probabilities of these parameter values. 
So, the prior distribution function of A has steps of height 9 } at kj, 
j= 1,2, ...,k. The Bayes rule is derived by essentially the same 
reasoning as applied in the case k = 2. It states that k l or k 2 , . ■ ., k k is 
selected according as 9 1 f(x |A x ), 9 2 f(x\k 2 ),...,8 k f(x\k k ) is greatest 
when x is observed. 

As far as EB implementation is concerned we have here the 
problem of estimating the finite mixing distribution described above, 
i.e. the prior distribution of A. Methods of estimating 0 l ,d 2 ,...,0 k 
have been discussed in Chapter 2. We shall consider just one example 
in some detail. 


Example 5.3.1 Assume that the data distribution of X for given A is 
N(k, 1), and that k — 3. In the EB setting we have to estimate d u 0 2 , 
and 0 3 = 1 — 0 X — 0 2 . Using the method of moments we note that 

E{X g ) = 2,0, + k 2 0 2 + A 3 (l - 8, - 9 2 ) 

E(X 0 ) = 1 + X{6i + k\8 2 + A|(l - 9,-92 ) 

Denoting the sample first and second moments calculated from past 
observations by m\ and m' 2 , (5.3.1) suggests that we take as estimates 
of and 0 2 


0i = 

'(A1-A3) 

(A 2 — 2 3 ) 

- 1 

1- 

fO 

1 

'5 

I_ 

II 

1 N 
ICD 


(*!-*!)_ 


— 1 

rs m 

1 

1 

s 

_1 


whenever 9 { , 0 2 > 0, and 0[ + 0 2 < 1. When (0,, 0 2 ) falls outside these 
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boundaries a point within or on the boundary must be chosen 
according to some additional criterion. One possibility is to select the 
boundary point closest to (9 U 0 2 ). Since m' r -* E(X G ), ( P ), r = 1,2, as 
n-* oo, it follows that such an EB rule is a.o. (P), and all W(-) being 
< 1, a.o. (E) is implied. 

The normal distribution has monotone likelihood ratio, hence the 
acceptance regions A 1 ,A 2 ,A 3 are the intersections of certain inter¬ 
vals. Let £ be the solution of 


f(x\X r )/f(x\X s ) = e s /d r , r <s; r= 1,2; 5 = 2,3. 


Then 

= [ — 00 > £i2g] n [ — °°> £ 1 3 g1 
A 2 = (^i 2 G) d- oo]n[-CO, ^23 g3 

^3=(^13G. + °0]Gi(^ 23 g» + <*>]• 

Note that it is possible for A 2 to be empty. When £ 23 g > ^i 2 G we have 


W(fia) 


/* + 00 

=M 


w(x — )dx + 0 2 w(x — X 2 )dx 


r 


+ 02 


f230 


r 


w(x — k 2 )dx + 0 3 w(x — A 3 )dx, 


where ^. c = min(^ 12C , £ 13G ), £. 3 g = max(^ 3G , £ 23G ), and w (“) de¬ 
notes the standard normal density. Obvious modifications are 
required in (5.3.4) when ^ 23G > £ 12G . 


The EB rule is obtained by replacing d r , 0 S in (4.3.3) by 9 r , 9 S , and 
W($ G ) is defined similarly to W(d G ). Using (4.3.4), it is easy to find 
W(S G ) for any given S G . Hence, if the joint distribution of 9 j and 0 2 
can be determined, it is in principle also easy to calculate E„W(S G ), for 
any given set of 0’s. In fact, such calculations are tedious, and 
Table 5.2 gives results of a very limited study of the performance of d G . 
The non-Bayes rule, T, to which the table refers, is defined by 
zlj = [— go, (Aj + A 2 )/2j1> A 2 = ((^i + A 2 )/2> (A 2 + A 3 )/2], and A 3 = 
((1 2 + A 3 )/2, 4- oo]. 

The results of Table 5.2 are substantially in agreement with those of 
Table 5.1. 

No new principle is involved when k > 3 and we shall, therefore, not 
discuss this more general case in detail. As k increases, estimation of G 
becomes more troublesome, and we refer to earlier comment on this 
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Table 5.2 Three simple hypotheses H,:a = — 1, H 2 : 
2 = 0, Hy./, — -f 1 with N(2, 1) kernel distribution; W(-) 
is given for 10 EB rules each based on n = 100, with 
<9j =0-1, d 2 =01, 0 3 = 0-8 


IT(Bayes) 


0168 

W(T) 


0-339 

1T(EB): sample 

1 

0-169 


2 

0-199 


3 

0172 


4 

0-312 


5 

0-190 


6 

0-399 


7 

0-172 


8 

0-168 


9 

0-168 


10 

0-179 

Mean 1P(EB) 


0-213, s.e. = 0-025 


problem (Chapter 2). Robbins (1964) has dealt with the case of k 
hypotheses, and Deely and Kruse (1968), amongst others, have 
treated the same problem. The closely related compound decision 
problem of choosing between k simple hypotheses has been studied 
by Samuel (1965), but as far as the author is aware, no studies of the 
behaviour of EB rules for k > 3 and finite n have been reported. 

5.3.2 Variable m t ^ 1, m > 1 

In the notation of section 5.2.2, the Bayes rule chooses X u X 2 ,..., X k 
according as 0 1 f(x\A l ), 0 2 f(x\k 2 ),.. ., 9 k f(x\X k ) is greatest. Estim¬ 
ation of 9 U 6 2 , ■ ■ ■, 0 k in order to construct empirical versions of the 
Bayes rule is somewhat more complicated here than in the previous 
section, but follows the methods suggested in section 5.2.2; see also 
section 3.8. Qualitatively, results similar to those illustrated in 
Example 5.2.1 can be expected for the performance of EB rules 
relative to Bayes rules. Details have not been worked out. Here, as 
elsewhere, a more interesting question is the assessment of the 
performance of the EB rule for a given set of previous data. A possible 
approach to this problem is by using a bootstrap procedure. 

Consider for the present discussion the case k = 3 as in 
section 5.2.2. Suppose that estimates 9 lt 0 2 are obtained from the 
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given set of past observations. Now VE(Bayes) is a function of 0 U d 2 
and we shall write it as W B (9 U 0 2 )- We can now estimate 
WfBayes) by W B 0 1 ,§ 2 ). Then treating (0i,0 2 ) as if it is the true 
(i 9 1 ,6 2 ), we can obtain an estimate of £„W(EB), which is also a 
function of 6 U 0 2 ; write it as E„W e b (0i, 0 2 ). In theory we estimate 
it by E„W eb (9 1 ,G 2 ). The actual evaluation of E n W EB 0 u § 2 ), or of 
E n W EB (9 u 6 2 ), may be feasible only by simulation. 

The steps in this first stage, then, are: 

1. Find an estimate 0 2 ) of (0,, 0 2 ). 

2. Generate sample sizes m\,m' 2 ,..., m'„ according to the scheme that 
produced the observed m lt m 2 ,.. ., m„. If the mechanism is known 
it can be used, otherwise use the bootstrap approach of sampling 
from the empirical distribution of M. 

3. Generate a X 1 ,X 2 ,X 2 sequence according to the probabilities 
0J. 0 2 . 03 = 1 — 01 — 02* 

4. Calculate estimates (0'i, 0' 2 ) - actually they are estimates of 

(0i. 02) - and find the corresponding IE(EB), i.e. 0 2 ). Note 

that Web^j, 0 2 ) = W B (0j, 0 2 ) + a positive quantity. 

5. Repeat steps 2-4 a number of times and calculate the mean of the 

0 2 ) values; it is an estimate of E„W eb {8 u 0 2 ). 

The next stage in these calculations is to consider the variability of 
W n ((\, 0 2 ) as an estimate of W B (d lt 0 2 ).^Thjs is essentially also a 
bootstrap operation; the sequence of W B (d j, 0 2 ) values which can be 
obtained from repetitions of step 4 gives an estimate of the sampling 
distribution of , 0 2 ). However, to obtain a sampling distribution 
of £„IE eb ( 0 1 , 0 2 ) the entire sequence of steps 1-4 has to be repeated, a 
straightforward procedure in principle but involving many steps. 


5.4 Two composite hypotheses 

5.4.1 Introduction 

Any non-simple hypothesis will be referred to as a composite 
hypothesis. In this section we deal with two composite hypotheses of 
the form: H 1 :a < A 0 , # 2 :A > ^o- In specifying these hypotheses it is 
clearly implied that G(X) will not be restricted to the class of finite 
step-functions with jumps at given points A t . It may be a 

continuous distribution, or it may be discrete with an infinite 
number of jumps. The possibility of it being a finite step-function is 
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not excluded, but the values of A u A 2 ,...,A t will not usually be 
assumed known. 

Two composite hypotheses of the type H 1 :A<A 0 ,H 2 :A'^A 1 , with 
A x > A 0 , are often considered in classical theory of testing hypo¬ 
theses. The prior distribution implied by these alternatives would 
have the appearance of the two ‘tails’ of a distribution, the central 
portion having been removed. Such a prior distribution is felt to be 
rather unrealistic in the EB context, and will not be consi¬ 
dered. In practice we may expect to encounter problems in which the 
real interest centres on the question of whether A is < A 0 or ^ A u but 
the possibility of a A in the interval [A 0 , A x ) is not excluded. A suitable 
formulation of the problem would be in terms of three composite 
hypotheses, H u H 2 and H 3 :A 0 <A^A 1 , H 3 representing an ‘indif¬ 
ference’ state. 

Although we shall devote attention mainly to the 0-1 loss 
function, the loss system described in section 1.5 is important in 
having special significance for the exponential family of distributions. 
The reason is that EB rules can be developed without explicit 
estimation of G, under this system for such distributions. For this 
reason it has played an important part in the development of the EB 
approach to testing of hypotheses. 

5.4.2 The loss system |A — A 0 | 

Details of the Bayes solution for this case are given in section 1.5. 
Essentially the solution is contained in (1.5.2), and it states that or 
H 2 should be chosen according as the posterior mean of A given x 
is < or ^A 0 . 

When we consider the exponential family of distributions, (1.5.2) 
can be put in the form 

C(x)p G (x + 1) - AoPg(x) $ 0. (5.4.1) 

In certain cases <5 G (x) = C(x)p G (x 4- l)/p G (x) is monotonic in x, 
and then the acceptance regions A 1 and A 2 for H x and H 2 are A t = 
{x: x < x 0 } and A 2 = {x: x > x 0 }, where x 0 is such that <5 G (x 0 ) < A 0 , 
+ 1) ^ ^o- 

We consider the Poisson case in detail. First we observe that <5 G (x) 
is non-decreasing for all G(A); this is discussed in section 1.9. 
Therefore the Bayes decision rule can be formulated simply in terms 
of the point of dichotomy, x 0 . 
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Now, referring to our discussion of the Poisson case in section 1.9, 
a natural formulation of an EB rule is in terms of the EB point of 
dichotomy x 0 , given by 

<5„(*o) < *o, &n(* o+l)^ 0 . (5-4.2) 

where <5„(x) is the EB point estimator defined in section 1.9. Since 
<5„(x) is not, in general, a ‘smooth’ function of x, the inequalities 
(5.4.2) can be satisfied by more than one value x 0 . However, for any 
given set of past observations, W(3„) can be found by substituting A 1 
and A 2 for A t and A 2 in (1.5.1). 

It is easy to see that since <5„(x) -> <5 G (x), (P), as n-> oo, A t -*A U 
A 2 -> A 2 , and by arguments similar to those of section 3.2, 6 n is 
a.o. As in most of this work, exact results for finite n can be easily 
obtained in principle, but in reality only approximate results can 
be found with a moderate amount of trouble. 

Previous results for S„ have shown that it is rather poor compared 
with the ‘best’ non-Bayes procedure unless n is quite large. A similar 
tendency may be expected here but no numerical results are 
available at present. 


5.4.3 The 0-1 loss structure: parametric G families 

The Bayes solution for this case is essentially given by (1.5.1). Since 
calculation of the posterior median of A is required, so that the result 
depends on 

\ f{x\k)dG{k), 

J A 0 

it is generally not possible to find the solution in terms of the mixed 
p.d.f., as in section 5.4.2. In order to obtain an EB rule, we shall have 
to find an explicit estimate of G, or an estimate of an approximate G. 
Thus we have to consider the use of such methods as were developed 
for smooth EB point estimation. Of these, the case where G is known 
to belong to a certain parametric family, is the most straightforward. 
Here, and elsewhere, the procedures for estimating G are the same as 
those outlined for the problem of point estimation. Details follow for 
two examples, the normal and binomial each with conjugate prior 
G(A). 

1. f(x\X) = N(X, 1), dG = N(p,a 2 ). Since the posterior distribution 
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of A given * is normal with 

£(A|x) = * + = median(A|x), 

the Bayes may be stated as follows, 

accept Hi if x < 

accept if 2 if xc > <?g» 


where 

£ g = A 0 (1 + 1/<x 2 )-//At 2 . (5.4.3) 


See also Example 1.3.1. 

The Bayes risk, W(£ a ), is 


W(S G ) = 


'to 


•a 

•/ A< 


f(x\k)dG(k)dx + 

- 00 J Ao J <G 


r 

J 


' Xo 


f{x\X)dG(X)dx, 


(5.4.4) 


and since the joint distribution of A and X G is normal, W(t, G ) can be 
found from the tables of the bivariate normal distribution. 

Estimation of p and a 1 has been discussed in section 3.7.3. The EB 
rule is developed by replacing G(i) in the formulation of the Bayes 
rule by the estimated, i.e. empirical, G{X). Consequently we obtain an 
EB ‘cut-off’, <f G , on replacing p and <r 2 in (5.4.3) by /I and a 2 . For the 
EB rule we have W{ <f c ), given by (5.4.4) with £ G replaced by <f c . Since 
(/I, a 2 ) -* {n, a 2 ), (P), this rule is a.o. because all losses are bounded by 0 
and 1. Some details of the performance of this EB rule for small n are 
given in section 5.4.5. 

2. p{x\X) — Bin(A,2), dG{X) = ;/“‘(1 - i)«- 1 dX/B(p,q). Estima¬ 
tion of p and q can be performed by the method explained in 
section 3.9.2. The posterior p.d.f. of A is 


dG(k |x) = 


^-^ l - A )***-*- 1 
B(p + x,N + q-x) 


dk. 


and the posterior median A 0 5 | x is found from 


r 


dG(k\x) = 0-5. 


Then Hi or H 2 is accepted according as / 0 . 5 | X is < A 0 or ^ A 0 . The 
EB rule is defined similarly, with the estimates (p, q) replacing ( p , q) in 
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the formula for G{k\x). Again, the EB rule is clearly a.o.; its 
performance for finite n is examined briefly in section 5.4.5. 

5.4.4 The 0-1 loss structure: approximating G 

The idea of approximating an unknown G by a step-function G k has 
been explored at some length in connection with the problem of 
point estimation. Since we now want to estimate the median of the 
posterior distribution, it seems more appropriate to approximate G 
by a continuous G* of the form defined in section 2.4.3. It may be 
observed that G k corresponds to approximation of G by a discrete 
distribution, this being analogous to the procedure used in calculat¬ 
ing means from grouped data. Approximation by G* is tantamount 
to dG being approximated by a histogram. 

The arguments in section 2.7 can easily be re-framed to justify 
approximation of G by G? through the maximization of 
|log {/ c *W}dF c (x). In EB applications this implies estimate of X 
by maximizing the approximate likelihood 

n 

L= Y, l°g{/ 0 *(x,)} w.r.t. variation in k. 

Since L/n-* Jlog {g c *(x)}dF G (x), (P), as n~> oo, the EB rule based on 
the estimated G* is a.o. as fc->co. 

Successul application of this process in practice depends partly on 
the goodness of the approximate Bayes rule for small values of k. 
This aspect of the approximate procedure is examined for normal 


Table 5.3 Normal-normal case, decision between two hypotheses H t :k^ A 0 , 
H 2 :k > k 0 . Values of W(^ G ) and W(£*) are given for various a 2 , k, k 0 when 
G is approximated by the continuous distribution Gf: dF(x\k) = N(k, 1), 
dG(k) = N(0,o 2 ) 


G(k 0 ) 


<7 2 = O'l 

mi) 

mm 

m G ) 

a 2 - 0-5 
Witt) 

mt) 

0-5 

0402 

0402 

0402 

0-304 

0-304 

0304 

06 

0372 

0375 

0379 

0289 

0292 

0301 

07 

0296 

0301 

0297 

0248 

0256 

0255 

08 

0199 

0200 

0199 

0183 

0190 

0183 

09 

0099 

0-100 

0100 

0-097 

0-097 

0099 
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Table 5.4 Decision between k 0 in the case 

of a N(k, 1) kernel with G(A) = l—e~ xia , with a 2 = 01. 
Approximation of G by G% 


G(A 0 ) 

W( Bayes) 

W(approx. Bayes) 

W(T) 

0-5 

0-42 

0-44 

0-42 

0-7 

0-27 

0-28 

0-40 

0-9 

009 

009 

0-33 


F(x | A) in the examples which follow. The performance of smooth EB 
rules, based on the Gjf approximation, is studied in section 5.4.5. 

1. /(x|A) = AT (A), dG(X) = N(fi,o 2 ). Table 5.3 gives details of the 
approximation of G by Gf, summarizing values of W(£ a ), W(I,*), for 
various <r 2 , k, A 0 , where denotes the approximate Bayes rule 
obtained on replacing G by G?. In terms of W(-) the approximation 
is clearly excellent. 

2. G(A) = 1 —e~ XIA . There is no essential difference between this 
example and (1) above. It only illustrates the performance of the 
approximate Bayes procedure when G is ./-shaped, representing a 
useful practical extreme. Details are given in Table 5.4. Again we see 
that the approximate procedure is very good, as judged by VF(-). 

5.4.5 Performance of EB rules for finite n 

Although general results regarding asymptotic optimality can be 
formulated, it does not appear possible to obtain similarly general 
results for small n. Rather laborious calculations are required in 
studies of particular cases for small n. The examples which follow 
indicate that the EB rules can be satisfactory, and in some circum¬ 
stances substantially better than ‘conventional’ rules, for quite 
small n. 

1. /(x|A) = N(k, 1), dG(k) = N(ji,<? 2 ). When G is known to be 
normal, estimates of ft and cr 2 can be based on the mean and sample 
variance of past observations, x and s 2 . For a given x and s 2 , defining 
a certain <f G , W(f G ) can be found from tables of the bivariate normal 
distribution. Alternatively direct numerical integration of the joint 
normal distribution can be used. Employing the latter method, as 
well as numerical integration over the joint sampling distribution of 
x and s 2 , the values of E„ W(f G ) given in Table 5.5 were found. 
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Table 5.5 Values of E n {W(£ G )}, where <f G is an empirical Bayes cut-off based, 
on the knowledge that G(2) is a N(p,o 2 ) distribution, F(x\A) = N(X, 1) 



G(A 0 ) 

E t o{W(L)} 

E 20 {m 6 )} 

W(T) 

VF(£ C ) 

© 

II 

to 

0-5 

0475 

0469 

0402 

0-402 


0-6 

0-459 

0449 

0399 

0372 


0-7 

0-409 

0383 

0390 

0296 


0-8 

0-313 

0-281 

0372 

0199 


0-9 

0198 

0147 

0338 

0099 

a 1 = 0-5 

0-5 

0-406 

0373 

0304 

0304 


0-6 

0384 

0353 

0-299 

0289 


07 

0-314 

0285 

0284 

0248 


08 

0221 

0-200 

0256 

0183 


09 

0-110 

0104 

0-205 

0097 


The distribution of the ‘smooth’ EB is most easily studied 

by Monte Carlo methods, as in similar point estimation problems. 
Table 5.6 contains estimates of E„W(£t) for o 2 , k, A 0 as in Table 5.5. 
In the preparation of this table, maximization of L was carried out 
subject to the constraint X j+l — A;>e, £>0, to avoid computa¬ 
tional difficulties. Table 5.7 gives information on P[W(l$)> 
W(Tf], in the form of frequencies of the event [IK(<f*) > W(T)~\ 
occurring in a certain number of Monte Carlo trials. 

Certain expected trends are evident in these tables. For G(A 0 ) close 
to 05, E„W{^) is greater than W(T), but it becomes substantially 


Table 5.6 Values of where <ff is the smooth EB cut-off in the case 

F(x|A) = N(A,l), G(A) = N(0,a 2 ) 



G(2 0 ) 

n= 10 

n = 20 

n = 50 

o 

II 

to 

05 

0-478 + 0 003 

0473 + 0003 

0-475 + 0004 


06 

0-475 + 0-007 

0453 + 0006 

0-455 + 0-012 


07 

0-440 + 0-012 

0-398 + 0-011 

0368 + 0018 


08 

0-381+0-017 

0299 + 0-013 

0-259 + 0021 


09 

0272 ± 0-020 

0-182 ± 0-014 

0-130 ± 0-012 

tn 

o 

II 

<N 

to 

05 

0-404 + 0-011 

0-403 + 0-010 

0359 + 0007 


06 

0-406 + 0-016 

0387 + 0-010 

0-340 + 0008 


07 

0-376 + 0023 

0341 + 0-016 

0-284 + 0006 


08 

0-267 + 0-022 

0210 + 0 006 

0-205+0006 


09 

0132 + 0-011 

0118 + 0-008 

0112 + 0-004 
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Table 5.7 Frequency distributions of in two classes: (1) W(£ G ) < 

W(£,t) < mn (2) W(T) < W{£%) < 10 



C(Ao) 

(1) 

n= 10 
(2) Total 

(1) 

n = 20 
(2) Total 

(1) 

n = 50 
(2) Total 

a 2 = 0-1 

0-5 

0 

200 

200 

0 

200 

200 

0 

50 

50 


0-6 

31 

169 

200 

47 

153 

200 

5 

45 

50 


0-7 

119 

81 

200 

138 

62 

200 

39 

11 

50 


0-8 

130 

70 

200 

163 

37 

200 

44 

6 

50 


0-9 

151 

49 

200 

181 

19 

200 

48 

2 

50 

o 

II 

b 

0-5 

0 

50 

50 

0 

50 

50 

0 

50 

50 


0-6 

12 

38 

50 

5 

45 

50 

15 

35 

50 


0-7 

18 

32 

50 

15 

35 

50 

32 

18 

50 


0-8 

41 

9 

50 

47 

3 

50 

47 

3 

50 


0-9 

45 

5 

50 

48 

2 

50 

49 

1 

50 


smaller than W(T ) as G(A 0 ) increases. As before, the relative gain in 
using a Bayes approach diminishes as a 2 increases. Previous results, 
showing that the smooth EB approach can be satisfactory for small 
n, are confirmed. 

2. /(x|A) = N(f, 1), G(A) = 1 — e~ x/A . When G(A) is known to 
belong to the exponential family of distributions, the EB rule may be 
based on the estimate 

A = max(0,x) 

of A, where x is the sample mean of the past observations. In this 


Table 5.8 Binomial kernel distribution 


P(X 




A ) 1 


Beta prior distribution with parameters p— 10, q = 9, 
testing //,: X < 2 0 against H 2 : A > A 0 , n = 50, ‘parametric 
G' EB approach 


Values of E 50 W{-) 


G(A 0 ) 

1V(T) 

E so mu 

W(£ 0 ) 

0-9 

0-229 

0-102 + 0-001 

0-100 

0-7 

0-306 

0-253 + 0 005 

0-245 

0-5 

0-295 

0-324 + 0-014 

0-295 
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case an explicit formula for W G cannot be found, and £ a as well as 
fT(<f G ) must be found by numerical integration. The same applies to 
£*> and E„W(£,£). These results need little comment; they 

indicate that a rather skew G can also be approximated satisfactorily 
by G*, with small k. 

3. p(x|A) = Bin(N, A), dG(A) = A p ~ 1 (l - Xf- 1 /B{p,q)dl The EB 
rule, £ a , based on moment estimates p, 4 of p, q, when G is known to 
be member of the beta family of distributions, has been described in 
section 5.4.3. Data on the performance of <f G in the case p = 10, q~9 
are given in Table 5.8. For comparison W(T) is shown, where 
T = A 0 , a non-Bayes rule. Previous experience suggests that the 
performance of a smooth EB rule based on GJ* would be only slightly 
worse than the performance of £ a . 



CHAPTER 6 


Bayes and empirical Bayes interval 
estimation 


6.1 Introduction 

In this chapter we consider interval estimation of a single parameter 
and region estimation of a vector of parameters. The Bayesian 
decision theoretic approach considers the choice of a particular 
interval or region for an unknown parameter as a decision-making 
problem. Optimal intervals are sought to minimize an expected loss. 
The Bayesian inferential approach does not make use of the notion of 
a loss function. Instead it considers the posterior distribution of an 
unknown parameter as an inferential statement and may set regions 
for the parameters under the derived posterior distribution. They may 
be quantitatively similar to regions obtained by a non-Bayes 
approach but their interpretation is clearly different. 

The empirical Bayes approach aims at obtaining estimates of the 
Bayes regions which converge to the true Bayes regions when the 
amount of previous data increases. With respect to previous data sets 
in the typical EB sampling scheme the EB regions are random. 
Therefore a further requirement of EB regions may be that the 
converge probability in the usual relative frequency sense is as good 
as that of a classical, i.e. non-Bayes, interval. In particular, if a classical 
confidence interval can be constructed with a given probability level 
a, a good EB interval should have at least the same level a if it is of the 
same length. Alternatively it should have shorter length if it is of the 
same level. 


6.2 Intervals for single parameters 

Suppose that the single parameter X of the data d.f. F(x\X) of the 
observable random variable X is to be estimated by an interval (a, b ) 
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in the parameter space. Let X be a realization of a r.v. A with d.f. G(X). 
Suppose that x = (x t ,..,, x m ) is a set of independent observations on 
X. The problem is to obtain (a, b ) based on data x which is optimal in 
some sense. 

6.2.1 Optimal intervals 

Let A be the set of ordered pairs (a, b ) such that 

A = {( a,b ); — oo < a, b < oo,a < b}. 

If G(X) is known, the Bayesian decision theory seeks the ordered pair 
(a, b) from A which minimizes the expected value of a loss L((a, b), X) 
incurred by using (a, b) when X is the true value of the unknown 
parameter. A general class of linear loss functions is defined by 

( c L (a — X) + c 0 {b — a) if a^X 
c 0 (b~a ) if a<X<b (6.2.1) 

Cu(X — b) + c 0 (b — a) if b^X 

where c 0 , c L , c v are known positive constants. For given data x, a and 
b are functions of x. The overall expected loss of (a, b) is 

E G E D L(a,b, A) (6.2.2) 

where E G is the expectation with respect to G and E D is the 
expectation with respect to the joint distribution 

n *w) 

x= 1 

of X. If L(a, b, X) is bounded, the minimization of the overall expected 
loss defined in (6.2.2) is the same as minimization of the posterior 
expected loss given by 

E B L(a,b,A)~ jl,{a,b,X)dB(X\x, G) (6.2.3) 

where £(A|x, G ) is the posterior d.f. of A given data x and prior d.f. G. 

Winkler (1972) takes (6.2.3) as the starting point of a decision 
theoretic approach to interval estimation and considers the class of 
loss functions 

L(a, b, X) = L L (a ~X) + L V (X -b) + c 0 (b- a) (6.2.4) 
where c 0 is a known non-negative constant, L l ( ) and L v ( ■) are 
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monotone non-decreasing functions with L l (x) = L v {x) = 0 for all 
x < 0. For the special case (6.2.4) the posterior expectation (6.2.3) 
becomes 

E B L(a, b. A) = B(a\x, G)E B {L L (a - A)| A a] 

+ {1 - B(b\x, G)}E b {L v {A - fc)|A ^ b} 

+ c 0 (b-a). 

The existence of a minimum of E B L(a, b. A) in this case is established 
by Winkler (1972) for the special class of functions L l ( ) and L v ( ) 
that are convex. A method of construction of the optimal interval 
(a, b) is indicated for functions which are differentiable. 

The loss function (6.2.1) is of special interest here. It is a special case 
of (6.2.4) with convex functions and also provides a justification for 
the use of quantiles of the posterior d.f. B(a\x, G) as interval estimates 
as is the practice in a Bayesian inferential approach. 

The optimal decision (a,b) for the special loss function (6.2.1) 
minimizes the quantity 


E B L(a,b,A) = c L j“ (a-A)dB(A|x,G) 

(A — b)dB(X\x, G) + c 0 (b — a) 

b 

B(A|x,G)dA 


+ Crj 


< 


^b 


■p Cy | J3(A|x,G)dA + c 0 (h-a). 

b 


Thus (a, b) is given by 

B(a\x,G) = c 0 /c L j 
B(b\x, G) = 1 -c 0 /c v J 

so that a solution with a <b exists if c 0 /c L + c 0 /c v < 1. One may 
choose c L = c v = c and set a = 1 — 2 c 0 /c. Then the condition for the 
existence of a solution is 0 < a < 1 and the problem is to find 100a% 
limits for A under the posterior d.f. £(A|x, G). 

Hence under the special loss function (6.2.1) and certain restrictions 
on the constants c 0 , c L , c v , the optimal decision theory approach 
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gives a justification for the use of percentiles in the Bayesian 
credibility interval approach. 


6.2.2 Bayesian credibility intervals 


In the uniparameter case, the Bayesian inferential approach employs 
intervals around some mode of the posterior d.f. of A. In the simplest 
case, a one-sided 100a% limit is taken as the lower 100(1 — a)% or 
upper 100(1 — a)% quantile of the d.f. B(A|x, G). To obtain two-sided 
limits, equal tail area quantiles are often used. Thus to construct an 
interval of the form [2*,<x>) with level a, one seeks AJ(x, G, a), a 
function of x, G and a, such that 


B(Xj(x,G,a)|x,G) = 1 - a. 

Similarly, to construct an interval of the form (— oo, X*] with level a, 
one seeks Xj$(x, G, a) such that 


B(AJ|(x,G,a)|x, G) = a. 

Thus, 

X*(x, G, 1 — a) = Xj(x, G, a). 

If two sided limits for A are required, we look for a pair [A£(x, G, 
(1 — a)/2), 2*(x, G,(l + a)/2)] where a is equally distributed in the two 
tails of the d.f. B. Hence it suffices to find the lower limit Xj (x, G, a) in 
terms of x, G and a. 

An alternative to percentiles is to use a highest posterior density 
(HPD) region (see e.g. Box and Tiao, 1973). This is given by the set of 
values of A such that 


b(X\x,G)>c. (6.2.5) 

where b( • | x, G) is the p.d.f. of B and c a is a constant determined so that 
the region has probability content a under the d.f. B. The region 
defined by (6.2.5) has also a justification in terms of Bayesian decision 
theory if the loss function is taken as 

L((a, b). A) = c 0 (b — a) — J ai „(A) 

where J a>6 (A) = 1 if a < b < A and 0 otherwise (see Joshi, 1969). 

As in the study of point estimators certain specializations of G are 
of interest. First we have the parametric G case which is important 
since a detailed treatment is possible, giving a better understanding of 
the problems of interval estimation. The finite approximation to G by 
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G k is again useful since this provides a practical and versatile family 
applicable to many problems where exact specification of G is difficult 
and a completely unspecified G is not attractive due to identifiability 
requirements. 

Suppose that G belongs to the parametric family of G(A|£) 
where ^ = is a vector of parameters. In such cases the 

Bayesian limits are directly determined in terms of £ and notations 
such as A£(x, £,a) will be used to emphasize this fact. 

Suppose next that G is approximated by a step-function G t (A) with 
concentration of probability at the points w ( (i = 1,..., k); we can 
approximate the posterior d.f. of A by 


where 


B k U | x, m) = £ { I! /(*■• I w j) f / M*. €> m ) 

(i=l )l 

= £ 0 /|.n /(*iK)j 


and <j; stands for the set of parameters (9, w). The Bayesian credibility 
limits for A can be obtained approximately as 

1. A£(x, <=, a) = i{C L (x, & a) + D L (x, a)}, where 


C L (x,<®,a) = inf[A:B*(A|x,4,m)S* 1 - a] 

A 

D l (\, <=, a) = sup [A: £ k (A | x, £ m) < 1 - a]; 

A 

2. A|5(x, §, a) = H c c/(x, a) + B^x, §, a)}, where 


C v (x, a) = inf[A: B t (A|x, §, m) ^ a] 


D[/(x, 4i,a) = sup[A: B t (A|x, 
a 


and similarly for two-sided limits as 



1 + aV 

2 ): 


6.2.3 Empirical Bayes confidence intervals 

When G is not known, the Bayesian limits AJ(x,G,a) or Ag(x, G,a) 
cannot be found. Suppose however that when the current observation 
vector x is observed, corresponding to the current realization A of A, 
there are available past observation vectors x t ,... ,x„ corresponding 
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to the realizations ..., k n of A, where x ; = (x a ,..., x im .) T . Then the 
unknown G can be estimated from such data, and hence estimates of 
the Bayesian limits can be obtained. Let G m be a consistent estimator 
of T at every point of A in the parameter space ft. Then under fairly 
general conditions it can be shown (Rutherford and Krutchkoff, 1969) 
that 

Af(x,G„,aHA£(x, G, a). 

Thus it is possible in principle to construct consistent estimates of the 
Bayes limits. 

However, the intervals or regions obtained by estimated quantities 
such as AJ(x, G > a ) no longer possess a fixed probability content a as 
was intended for the corresponding Bayesian intervals or regions. In 
fact, these regions themselves are random regions and the probability 
content of such a region is a random variable with respect to 
variations in G„. 

One may impose the desired level of the probability content as a 
requirement of these regions. We may then say that these regions are 
EB regions. Since the probability content of a region based on the 
estimated Bayesian limits is a random variable, there are two ways of 
imposing a fixed level for it; one is an expected cover requirement and 
the other is a percentile cover requirement. These criteria are similar 
to those which are used in statistical tolerance region theory (see 
Guttman, 1970). 

Let R(x, G, a) be a general Bayesian credibility region for A based on 
data x, prior G and credibility level a. Let R{\, G m , a) be an estimated 
Bayesian region obtained by directly replacing G by its estimate G„ 
obtained from an EB scheme. This estimated region can be regarded 
as a random region in the parameter space as far as the sampling 
variation of G„ is concerned and has to be assessed accordingly. In 
particular, the coverage probability of R(x, G m , a) under the posterior 
d.f. is a random variable given by 

C[R(x,G„,a)] = | dB{X\x,G,n). 

J R(x,G m ,ct) 

Following the established pattern of treatment of non-Bayes statist¬ 
ical tolerance regions, one can concentrate on two main character¬ 
istics of C[R(x, G„,a)]. One is its expectation with respect to 
variation of G„. The other is a designated percentile of its distribution. 
One can thus introduce the following two criteria: 
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Expected cover criterion: An estimated region R(x, G„, a) is said to be 
an EB region of expected cover fi if 

E„{ClR(x,G n ,«n} = p (6.2.6) 

where £„{•} is the expectation operator with respect to variation 
of G„. 

Percentile cover criterion: An estimated region R(x, G„, a.) is said to 
be an EB region of percentile cover p with level 7 if 

Pr«{C[R(x, G„, a)] ^ p) = y. (6.2.7) 

Where P„{-} is the probability statement with respect to variation 
of G„. 

6.2.4 Construction of EB regions; parametric G priors 

We now consider the case where G belongs to a specified parametric 
family of distributions G(A \4) indexed by a set of unknown para¬ 
meters 4- For detailed study we shall take the Bayesian region 
R(x, 4, a) to be the interval 

R(x, 4,cc)= {X:X*{x,4,ol)<X< 00}, 

defined by the lower limit, X*(x, 4 , a)- The subscript L is deleted for 
notational convenience and G is replaced by the ^-vector 4 = 

(4 X . 4 q ). The parameter 4 is the only unknown element of G and 

can be estimated from an EB scheme as discussed in Chapter 2. 

Let 4„ be the ML estimator of 4 based on the data of an EB scheme. 
Then an estimated Bayesian region can be defined as 

R(x, 4 , a) = {A:X*(x, a) < X < 00}. 

The probability content of the estimated region R(x, |, a) is no longer 
a under the posterior d.f. B(A|x, 4, m). We shall now demonstrate how 
the criteria (6.2.6) and (6.2.7) can be employed to construct EB regions 
and hence an EB lower limit. Although the discussion here concen¬ 
trates only on a lower limit, the case of an upper limit or of two sided 
limits can be treated in a similar manner. 

(a) Expected cover EB limit 

The coverage probability of the region R{x, f, a) is 

C[R(x, I, a)] = 1 — B[X*(x, 4„, a)|x, 4, m]. 
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The exact value of the expectation or the percentile of C[R(x, f, a)] 
can be obtained in principle if the exact distribution of f„ is obtained. 
However, this will not be the case in general. Thus an exact solution to 
the problem of constructing EB limits, satisfying expected or 
percentile cover criteria, is not possible except in some special cases. 
Approximate solutions to the problem can be obtained by using 
techniques from statistical tolerance region theory. The specific 
methods employed here are implicit in the original works of Wilks 
(1941) and Wald (1942); the work of the former author has recently 
been extended to a general case by Atwood (1984). 

Let be a q x 1 vector whose ith element is the bias of the ith 
element, <f Bj , of Also let V(^) be a q x q matrix whose (i, j)th element 
is the covariance of £ ni and c, nj . Evaluation of ij/U;) and V(£) to terms 
of order n~ 1 can be performed as indicated in Chapter 2. 

One can expand the coverage probability C[R(x, a)] in a 
Taylor series of (<f Ki — £,). Retaining terms of 0(n ~*) in the expansion, 
we get, after taking expectation with respect to variation in f„, 

£„{C[R(x, !„«)]} 

= a+ t iA,(4)fr(I*(x, & a) | x, l m)- A * - ^ -- g 2 
i=i oCi 

+ it i r, 7 (!) r b(Z*(x, $, a)|x, I, 

i=i i=i L iOSj 

pfe(ylx,g,m) | dA*(x, g, a) dl*(x, g, a) 1 

1 <5y dtt dij J- 

Following the approach of Atwood (1984), we define the quantities: 

Bio = (d/cU)B(2|x, g,m), 

B 20 = (W 2 )B(A|X, g,m), 

£ 

B 01 as a q x 1 vector whose ith element is —B(X\x, g, m) 


B 02 as a q x q matrix whose (i,j)th element is 
(8 2 /d^j)B(Mx, g, m) 

where all the derivatives are evaluated at X = X*(x, g, a). We can now 
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rewrite the expected value of C[R(x, a)] as 
£{C[R(x,|„,a)]} 

= a - ^) T B 0 i + Bfo 1 {BSii tr {B 02 r(^)} 

= a + R(x, a), 

say, where r(§) = V(%) + ^(^) T . 

To construct an EB lower limit with an expected cover /?, we need to 
seek a for which £{C[R(x,a)]} is equal to /?. The equation 

E{C\_R(x,l, a)]} = )3 

can be solved iteratively for a. Let a*(/i) be the solution. Then the 
required EB region is given by R(x, f„, a*(/?)) and the corresponding 
EB lower limit by X*(x, a*(/?)). 

An approximation to can be obtained by steps similar to 
those of Cox (1975) by using a first-order solution, 

(b) Percentile cover EB region 

To construct an EB region of a given percentile cover [i with level 
y, we obtain an approximation to the distribution of the quantity 
C[R(x,|„,a)] induced by random variation in l n . For this purpose, 
we need to obtain an approximation to the variance of C[R(x, f„, a)]. 
Again, using a Taylor expansion, we have, to the same order of 
approximation as before, 

var {C[R(x, £„, a)]} 

= t Z CO 
f=1 j=1 

= {Boi F(|)B 01 }/B, 0 . 

We may now adopt an approach similar to Wald (1942) and use a 
normal approximation to the distribution of C[R(x, f„,a)]. The y- 
percentile of this distribution is then approximated by 

a + K(x, l a) + z 7 [{BJ 1 F(£)B 01 } 1/2 /B 10 ] 

where z y is the y-probability point of a standard normal d.f. Thus an 
approximate EB region whose coverage has a y-percentile equal to /? 
is given by R(x, y)) where 

«*(/»,y) *p- K(x,LP) ~ ^[{BSi F(S)B 0 


dX*(x, a) <5/*(x, <=, a) 


H, 


Hj 


b(X*(x, a)|x, £,m) 
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An alternative approximation can be obtained by following Wilks 
(1941) and using a beta distribution approximation to the distribution 
of C[/?(x,|„,a)]. Writing £{C[/?(x, |„,a)]} = p c and 
var{C[i?(x,a)]} = a G , we can express the parameters of a beta 
distribution by 

P* = {PcV - Pc) ~ PcGcVol (6 2 

q* = {Pc( 1 - Pc) 2 “ (1 “ Pc)<*c}l°c • ' ~ ' 1 

Let y y (p*,q*) be the (1 — y) probability point of a beta (p*,q*) 
distribution. Then 

Pr„ {C[R(x, a)] > y r (p*, q*)} ~ y. 

Thus for the region R(x, a) to be an EB region of a given 
percentile cover (l, the value of a must satisfy the following relation: 

y y (p*,q*) = P- 

For given /? and y, the required value for a can be obtained iteratively 
in the above equation. Let a*(/J, y) be the resulting solution; then the 
region R(\, a*(/?, y)) is an EB region satisfying the percentile cover 
criterion and the corresponding EB limit is X*(x, f„, oc*(JS, y)). 


6.2.5 An example with normal F and normal G 

Consider the special case given in Example 1.3.1. We have F(x|2) as a 
N(X, a 2 ) distribution function with known <r 2 , G{X |£) as a N(p G , a G ) 
distribution function with known er 2 , G(A|£) as a N(p g ,Oq) distri¬ 
bution function with known a G . The current data set is x = 
(x 1 ,...,x m ), being m observations on the r.v. X with d.f. F(x|A). The 
posterior d.f., fi(A|x, §), of A is N(p*,< r* 2 ) with 

P* = + d 2 )~ Hff^x + d 2 p a ) 

<?* 2 = (o 2 G + d 2 )- l <j 2 G d. 2 

where x is the sample mean and d 2 = o 2 jm. The Bayesian lower 
credibility limit of level a is 

X*(x, p G , a) = (<j 2 g + d 2 ) ~ 1 (crlx + d 2 p G ) + z 1 - x o G d{o 2 G + d 2 )~ 1/2 

where is the value such that (^(z! _J = 1 — a. 

Suppose that p G is unknown and an EB scheme with unequal 
component sample sizes is available as in section 3.8. Let x u ..., x„ be 
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the sample means from the previous stages of the EB scheme. Then 
the ML estimator of p G is given by 

n In 

fra = Z m i*i / Z m i 

1=1 / 1=1 

with the sampling variance 

= Z + <r 2 )^ .Z m ^j ■ 

The estimated Bayesian lower limit is X*(x, p G , a) and for the region 
R(\, fi G , a) = [X*(x, fi G , a), oo) it induces a coverage probability under 
the posterior d.f. of 

C[R(x, jfl G , a)] = 1 - 0{(<rS + d 2 )~ ll2 do G l (fi G - p G ) + Zi _„}. 

This problem is a very special case where exact solutions to both types 
of EB limits can be obtained. We shall also derive approximate 
solutions by employing the general procedure outlined in 
section 6.2.1. 

(a) Expected cover EB limit 

First we note that fi G is distributed normally with mean p G and 
variance V((L a ) given above. Next we note that the coverage 
probability can be rewritten as 

C[R(x, | n ,a)] = 1 -0(pZ + Z!_ a ) 

where 

p = (ff! + d 2 r 1 / 2 d<x G 1 j £ m 1 (ro l <r§ + <r 2 )j / 

and Z is a standard normal IV(0,1) r.v. Thus the exact value of the 
expected coverage probability is 

£„C[R(x, l, a)] = 1 - + p 2 y l ' 2 l 

For the estimated Bayes lower limit X*(x, a) to be an exact 
expected cover EB lower limit with level /?, we must have 

zi-*(i+p 2 r 1 / 2 =zi-p 


or 


«= 1 + P 2 ) 1 ' 2 }- 



184 


BAYES AND EMPIRICAL BAYES ESTIMATION 


Table 6.1 Value of a to give expected converage [i. 


p 

a 

n= 10 

Approximate Exact 

n = 30 

Approximate Exact 

0-50 

0-500 

0-500 

0-500 

0-500 

0-55 

0-551 

0-551 

0-550 

0-550 

0-60 

0-602 

0-602 

0-600 

0-600 

0-65 

0-653 

0-653 

0-650 

0-650 

0-70 

0-704 

0-703 

0-700 

0-700 

0-75 

0-754 

0-754 

0-750 

0-750 

0-80 

0-805 

0-805 

0-801 

0-800 

0-85 

0-855 

0-855 

0-851 

0-850 

0-90 

0-904 

0-904 

0-901 

0-901 

0-95 

0-953 

0-953 

0-950 

0-950 


Thus the required EB lower limit is given by the expression for 
X*(x,|„,a) with z x _ a replaced by z,^! + p 2 ) 1/2 . 

Next we apply the approximate technique of section 6.2.1. To terms 
of where M = £"= , m„ we have 

E n C[R(\, a)] = a + var(|„){(cr£ + d 2 )- 1 d 2 aj 2 }z 1 _ a i/)(z 1 _,)/2 

where 4>( ) is the p.d.f. of the N( 0,1) distribution. Thus for the expected 
coverage probability to be /?, a must satisfy the equation 

« + P 2 Zl -J/2 = /?• 

For values of fi = 0-50, (005), 095, n = 10, m l =m 2 = ■■■ = m n = 1 
and (T 2 = Gq — 1, corresponding values of a are computed from the 
exact formula as well as the approximate formula. These are 
given in Table 6.1. 

(b) Percentile cover EB limit 
The event 


C[K(x, *>0 


is equivalent to the event 

(o 2 a + d 2 )~ il2 d(TQ 1 (Ji G - p c ) + z t _ a < Zj 
i.e. 
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where Z is defined above, a N( 0,1) r.v. Thus 

Pr {C[K(x,/» g , a)] ^ /?} = ®{(z x - Zl _J/p 2 }. 

Hence for the estimated limit X*(x, fi G , a) to be an exact EB limit with 
a percentile cover at level y, we need to have 

(z 1 _^-z 1 _J/p = z y> 

i.e. Zj is replaced by z x - pz y or a by 1 - <D(z x — pz y ). 

Next we apply the approximate technique of section 6.2.1. We can 
obtain the variance of C[/J(x, to terms of 0(M _1 ) as 

var {C[K(x,/! G ,a)]} = p 2 <p 2 (z 1 


Table 6.2 Values of a to give percentile coverage specified by fl and y 


p 

y 

n= 10 

Approximate 

ot 

« = 30 

Exact Approximate 

Exact 

0-5 

0-5 

0-500 

0-500 

0-500 

0-500 

0-5 

0-6 

0-521 

0-520 

0-507 

0-507 

0-5 

0-7 

0-543 

0-542 

0-514 

0-514 

0-5 

0-8 

0-568 

0-567 

0-522 

0-522 

0-5 

0-9 

0-603 

0-601 

0-534 

0-534 

0-6 

0-5 

0-600 

0-600 

0-600 

0-600 

0-6 

0-6 

0-620 

0-619 

0-607 

0-606 

0-6 

0-7 

0-641 

0-640 

0-613 

0-613 

0-6 

0-8 

0-664 

0-663 

0-621 

0-621 

0-6 

0-9 

0-696 

0-695 

0-632 

0-632 

0-7 

0-5 

0-701 

0-700 

0-700 

0-700 

0-7 

0-6 

0-718 

0-717 

0-706 

0-706 

0-7 

0-7 

0-736 

0-735 

0-712 

0-712 

0-7 

0-8 

0-756 

0-756 

0-719 

0-719 

0-7 

0-9 

0-783 

0-782 

0-729 

0-729 

0-8 

0-5 

0-801 

0-800 

0-800 

0-800 

0-8 

0-6 

0-815 

0-814 

0-805 

0-805 

0-8 

0-7 

0-829 

0-828 

0-810 

0-810 

0-8 

0-8 

0-844 

0-844 

0-815 

0-815 

0-8 

0-9 

0-863 

0-864 

0-823 

0-823 

0-9 

0-5 

0-901 

0-900 

0-900 

0-900 

0-9 

0-6 

0-909 

0-909 

0-903 

0-903 

0-9 

0-7 

0-918 

0-917 

0-906 

0-906 

0-9 

0-8 

0-926 

0-926 

0-910 

0-910 

0-9 

0-9 

0-937 

0-938 

0-914 

0-914 
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We can now proceed to calculate the quantities p* and q* of (6.2.8) 
using 

/ic = a + P 2 Zi- < ,0(z 1 -«)/2 
<7 c 2 = p 2 ^ 2 (z I _J. 

Let y y (p*,q*) be the (1 — 7 ) probability point of a beta distribution 
with parameters p* and q*. Then a must satisfy the equation 

y y (p*,q*) = P- 

For values of)?, y = 05, (0-05), 095 and cr 2 = a^= l, = ■■■ = m„= 1 
the approximate value of a obtained from the above equation is 
compared with the corresponding exact value of a in Table 6.2. 


6.3 The multiparameter case: region estimators 

Suppose now that the data d.f. F(x|A) depends on a p- vector 
parameter A = (A,,..., A p ). An estimate of A by a p-dimensional 
region, R, a subset of the parameter space, is sought so that some 
optimality criterion is achieved. Let A be a realization of a random 
vector A with d.f. G(A) and x = (x 1 ,...,x„) be a set of independent 
observations on r.v. X with d.f. F. 

6.3.1 Optimal region estimators 

As in the single-parameter case, the Bayesian decision theoretic 
approach is to construct an optimal region based on a chosen loss 
function. A multiparameter analogue of the loss function (6.2.1) is 
given by 

L(R, A) = c 0 Vol (F) — I(R) 

where Vol(F) is the volume of a region R and /(-) is the indicator 
function which has value zero if A eR and value one otherwise. Joshi 
(1969) has shown that the optimal region which minimizes the 
expected loss, E G E D L(R, A), is given by the HPD region 

R*(x, G, a) = {A: b( A| x, G)^KJ (6.3.1) 

where K x is a constant determined so that the region has probability 
content a under the posterior d.f. B. 

The region defined in (6.3.1) is not necessarily a multidimensional 
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rectangle shape, when p > 1. On the other hand, a Bayesian credibility 
region can take any chosen shape as long as its coverage probability is 
the given specified value a. A multidimensional rectangular region is 
given by 

R + (x, G, a) = {A: £,t(x, G, a) < k, < X^(x, G, a); i = 1,2,..., p}. 

(6.3.2) 

This region is a multidimensional analogue of the two-sided limits 
and the quantities (X^ , X^) are determined so that R + {x, G, a) has 
credibility level a. One-sided analogues are obtained when the limits 
for any are taken to be negative or positive infinity. 

6.3.2 Emperical Bayes confidence regions; general priors 

We now consider EB confidence regions analogous to those devel¬ 
oped for the single-parameter case discussed earlier. The unknown G 
can in principle be estimated nonparametrically by the ML technique 
of Lindsay (1983a, b) as summarized in Chapter 2. Once a consistent 
estimator G„ of G is available, consistent estimators of the limits %? L 
and Xjcan be obtained. When G is replaced by an estimate G„, the 
region R(x, G„, a) induces a random coverage probability 

C [K(x, G„, a)] = I dB(k \ x, G, m) 

J R(x,G„ a) 

which again needs to be adjusted in terms of a to achieve a desired 
expected coverage or percentile coverage. The criteria described by 
(6.2.6) and (6.2.7) are directly applicable in constructing EB confid¬ 
ence regions. In the next section, we demonstrate the application of 
these criteria to estimated Bayesian regions of the form (6.3.2) when G 
belongs to a parametric family G(k\£,). 

When an EB scheme is available, estimation of the unknown 
parameter £ follows the procedures given in section 2.9. It should also 
be noted that although the discussion in the next section is for 
univariate multiparameter d.f. F(x | A), the calculations are essentially 
the same for the multivariate multiparameter case. 

6.3.3 Construction of EB confidence regions: parametric priors 

We shall consider the special case of the Bayesian region given by 
R + (x, £, a) in (6.3.2) where the upper limits are taken to be infinity for 
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each ij. The estimated Bayesian region is thus 

i? + (x,|„,a)={A:I j f (x,|„a)<A j < oo,i= (6.3.3) 

where the subscript L of the lower limits is omitted for convenience. 
The case of upper limits can be handled in a similar manner. 

The probability content of the region (6.3.3) under the posterior d.f. 
of A is 


C[R + (x,|„, a)] 




b(A|x, £, m)dX u . 

it 

say. 


M. 


(6.3.4) 


We now obtain multiparameter analogues of the results in 
section 6.2.1, using the same criteria to develop EB regions. 


(a) Expected cover EB region 

To apply the expected cover criterion to adjust the level a of 
R + (x, a) we need to obtain the expected value of C[R + (x, a)], 

defined in (6.3.1), as f„ varies. To terms of order 0(n -1 ), we have 

£„C[R + (x, | n ,a)] = a + m T {(d/dl)I(L $,«)}*,.« 

+ tr[T m(d 2 /d£ n d? n )I(Ll«)}\-t 

where <p(%) and T(^) are, respectively, the bias vector and mean square 
error matrix of and further 

1. ( d/d£„)I {£„, a) is a q x 1 vector whose ith element is 
(d/dmLt, a); 

2. (d 2 /8^ n d^J )/(!„,§, a) is a q x g matrix whose (i, j)th element is 
Let 


k(Lt,«,k) 


f* CO /* CC f*O0 r 00 

= •• + I — I + t(A|x,§,m)dA 1 — 

J J ■'•S + 1 J A J - 1 V^l 


and also, 

/»(!.» 6 M s , At) = 


’ 00 p 00 pao /* CO f* 00 

Ap j^+1 * - 1 ^A a+1 JX s _| 
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-i; 


b(X\x, £,m)d 
dkf _ j dl, +1 • ■ ■ dl p . 


We then have 


{(»)/(!„,i, <x)}|„_*= - t 

s=l t'si 

Also, 

{(^>^>(4^4.*)}^ 

- £ 4 ( 4 . 4 ,«, x; (x, £, a), x, + (x, &«))} 

r*s VQj J 

+ /,(!, §, a, X s + (x, §, a ))(d 2 /d£, 34,)X S + (x, £, a) . 

We need to evaluate first- and second-order derivatives of X 5 + (x, £, a). 
For this purpose, we introduce the following notations: 


B 


r 00 00 I* co Too 

,(4.4,a,y)= • • 

Jx; Jc +1 Jy 


b(A|x,|,n)dAi--dA 


B sl0 = (d/dy)B,(£,£,a,y) 

S )20 =(W)b i (4,4.m). 

Also let B j01 be a <j x 1 vector whose ith element is 

{d/dtmu,*,y\ 


and B sll be a q x 1 vector whose ith element is 
(0 2 /^ i 3y)B 3 (S ! §,a,y) ) 

and B j02 be a q x q matrix whose (i,j)th element is 
(d 2 /d£ i d£ J )B a (£,£,ix,y). 

All the derivative given for B s , B 3l0 , B s20 , B 3l ,, B j02 are evaluated at 
y = l;{x,£,ct). 

Next, differentiating both sides of the equation 
/(££«) = a 


(6.3.5) 
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with respect to we get a set of equations which can be rewritten in 
matrix form as 


(8/8$)% (x, & a) = (B tl0 )- ‘B.oj. 

Differentiating both sides of (6.3.5) twice with respect to first and 
then and rearranging the resulting equations in matrix form, we get 

B sl0 {(mdj T R + (x,£a)} 

= ®s02 — (®sll®Joi/®slO + ®s01®Jll/®slo) 

— (®slo) 2 ®s 20 ®s 01 ®s 01 - 


Thus we have 

£„{c[/r(x, 

5=1 

+ i£ t U4,l,aJ s + ,X + )(B J10 )- 2 Bj 0i r(^)B (01 

S = 1 

Z trr(^)S j02 

5 = 1 

+i£ Aior'BJoin^n 

5=1 

= a + <£(a, x), say. (6.3.6) 

To construct an expected cover region of size /?, we need to find the 
value of a satisfying 

a + </){$, a, x) = p. 

This equation can be solved iteratively. An approximate solution is 
a~/?-<H|„,0,x). 

fbj Percentile cover EB region 

The above method of obtaining an approximate expression for the 
expected value of the posterior coverage probability C[R + (x, a)] 

can be extended to obtain a similar approximation to its variance as 
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4„ varies. To 0(n x ), we have 

var{C[R + (x,|„,a)]}= £ £ co v (£, £.) { (djdQl{$ n , 4, *) }|„ = 4 

i=lj=l 

x {(d/dQI(L^)} L ^ 

= t SJoiK(4)B j0 i. 

S= 1 

To construct a percentile cover EB region, we need to use an 
approximation to the distribution of C[K(x, 4,a)]. In particular a 
beta distribution approximation can be made along the lines of 
section 6.2.1, using equation (6.2.8). While a solution following such 
steps is straightforward in principle the actual implementation is 
numerically complicated. 


6.4 Bayes statistical tolerance regions 

In this section and in section 6.5 we discuss statistical prediction from 
the Bayes and EB points of view. The emphasis is on setting a 
tolerance limit or region for a single future observation or a set of 
future observations, instead of on point prediction. Thus the problem 
considered here could be regarded as a branch of the general problem 
of confidence region estimation. Indeed the techniques developed in 
sections 6.1-6.3 for Bayes and EB confidence region estimation can 
be readily adapted to obtain analogous results for the present 
problem. 

In non-Bayes tolerance region theory, the established convention is 
to distinguish between two types of tolerance regions, namely, expected 
cover and percentile cover regions; see Guttman (1970). A similar 
distinction is made in Bayesian tolerance region theory. In the 
following, we summarize Bayes prediction theory, taking as our 
starting point the assumption of a particular prior distribution of the 
parameters. This will pave way for the development of EB analogues 
in section 6.5. 

Let X be a continuous r.v. with d.f. F(x| A), and a p.d.f. f(x\k) 
depending on the unknown parameter vector k. Suppose also that the 
range, SC, of X is independent of k. Let X = (X l5 ..., X m ) be a set of m 
independent copies of X whose realization is x = (x x ,..., x m ). Let k be 
a realization of a vector r.v. A with d.f. G(k), concentrated on a 
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parameter space if. Then the posterior d.f. of A given X = x is 


where 


dB(A\x;G,m) = {h(x',G,m)} 1 



dG(k) 


h(x, G, m ) = 





rfG(A). 


(6.4.1) 


When a set of sufficient statistics T(X) with p.d.f. p m (tj A) exists the 
expression (6.4.1) reduces to 

dB( A|t; T, m) = {p G>n (t)} _1 p m (t|A)dG(A) 

where 

P G , B (t) = £p m (t|A)dG(A). 

Let £ {m) be the space of observations x and let 'S be the event space 
of 3C. Let A*( ) be a statistic with domain ?P m) and range & such that 
for each x, it provides a subset A*(x) of 1. Thus under F, 4*(x) 
induces the probability content 


C[A*(x), A] 


dF(y |A). 

A*(X) 


(6.4.2) 


The problem is to obtain 4*(x) of‘a given shape’ such that either of 
the following criteria is satisfied: 


Expected cover criterion: If the coverage probability defined in (6.4.2) 
is such that 


E B {C\_A*(x),\y = p (6.4.3) 

then 4*(x) is said to have an expected cover of size (F 

Percentile cover criterion: If the coverage probability defined in 
(6.4.2) is such that 

P r a (C[/l*(x), A] ^ /?} = y (6.4.4) 

then v4*(x) is said to have a percentile cover of size ft with level y. 


By the phrase ‘a given shape’, we mean, for example, that A*(x) may 
be an interval [u 1 (x),u 2 (x)] where one of the tij’s can be infinite. For 
simplicity we consider only those regions v4*(x) of the form [u(x), oo) 
to obtain a lower limit for a future observation on X. An upper 
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tolerance limit or a pair of two-sided limits may be obtained in a 
similar way. The problem of determining A*(x) of the form [u(x), oo] 
reduces to that of determining the quantity u*(x, /?, G) or u*(x, /?, y, G), 
say, which is the solution for u(x) in (6.4.3) or (6.4.4) when /4*(x) is 
expressed in terms of u(x). 


6.4.1 Expected cover regions 

The problem of determining u*(x,fl,G) which satisfies the expected 
cover criterion is readily solved since by Fubini’s theorem. We can 
write 


E b C[A*(x, A)] 



E„f(y\l)dy, 


i.e. u*(x,P,G) is the solution of the equation 


where 


f p(y|x, G,m)dy = fl. 


u*(x,0,G> 


p(y |x, G, m) 


f(y\k)dB(k\x,G,m) 

Jy 


(6.4.5) 


is the predictive d.f. or a r.v. Y representing a future observation y on 
X. To obtain (6.4.5) the knowledge of functional form of h(x; G, m) in 
(6.4.1) is enough. For we have 


p{y\x, G, m) = {h(x, G,m)} 


f(yW 


if 


n /(* .I*) 



= {h(x;G,m)} l h((y,x);G,m+ 1) (6.4.6) 


where (y, x) is treated as a ‘pooled-sample’ of the current observation 
x and the future observation y and h(-; G,m+ 1) is evaluated exactly 
the same as h(-; G, m) with m replaced by m + 1. 

Further reduction of (6.4.6) into simpler forms can be achieved 
when there exists a set of sufficient statistics for L Let T(X) and 
T(Y,X) be the sufficient statistics for k based on X and the pooled 
sample (y, X) respectively. First we note that (6.4.6) reduces to 


p(y|x, G, m) = p{y\T(x),G) 

= {PG,M}- l ^f(y\k)p m mdGW 
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where t is a realization of T(X) when X = x. Next, we use the identity 

f(y\k)p m (t\k) = r(y\T(y,x))p m+l (T{y,x)\k), 

where r(y | T(y, x))is the conditional p.d.f. of Y given T(Y, X) = T{y, x) 
and is independent of k. Hence we get 

P(y|x, G, m) = {p G , m (t)}" f(y\ T{y, x))p G , m+l (T(y, x)). 

Thus to evaluate the predictive density in this case, it is required to 
compute only the unconditional p.d.f. of T(X) and the conditional 
p.d.f. of Y given T( Y, X). Evaluation of these quantities is straightfor¬ 
ward once G is known. 

6.4.2 Percentile cover region 

Next consider the problem of determining m*(x, (I, y, G) satisfying the 
criterion (6.4.4) which requires a given size /i for the 100(1 — y)% 
percentile of the coverage C[A*(x), A] under the posterior d.f. 
B(k\x,G,m). Let the set of values of k for which the coverage 
probability defined in (6.4.2) exceeds /? be 

A[A,«(x),/?] = {A:C[zl*(x),A]>^}. 

Then u*(x, (i, r, G), satisfying (6.4.4), is the solution of the equation 

dB(k\x, G, m) = y. (6.4.7) 

jA[A,u*<*,/!,y,G),« 

Further simplification of (6.4.7) could be achieved if we confine /(• | k) 
to more special forms like location-scale types as illustrated in the 
following section. 

6.4.3 Location and scale family d.f. F(-\k) 

Let the p.d.f. F(-|A) be of the form F{(x — AJ/Aj} where F(-) is of 
known form and k = {k v k 2 ). Then 

C[zl*(x),A] = l-F{(u(x)-2 1 )/A 2 } 

so that 

A[*,u(x),/?] = {(k u k 2 )-,k l + kjth-.^uix)}, (6.4.8) 

where is such that 1 — /? and is known since F( ) is 

known. Three separate cases are investigated in detail: 
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1. A 2 known: in this case the unknown parameter is A l and we denote 
the (1 — y) probability point of the posterior d.f. 5 (1) (2 1 |x,G,m) of 
Ai by w ll) (x, 1 — y, G ). Then (6.4.8) and (6.4.7) give 

u*(x, P,y, G) = A 2 rj l + w m (x, 1 - y, G). (6.4.9) 

2. A { known: denote the (1 — y) probability point of the posterior d.f. 
B (2) (A 2 |x, G,m) of the unknown parameter A 2 by w <2) (x, 1 — y, G). 
Then from (6.4.8) and (6.4.7) we get the relation 

u*(x, /?, y, G) = A 1 + ri 1 _ fi w (2) (x, 1 -y,G). (6.4.10) 

3. Both A x and A 2 unknown: the required u*(x, j9,y, G) for this case is 
explicitly obtainable if the exact posterior d.f. of A! +f7i-^A 2 is 
known. Let w*(x, 1 — y, G) be the (1 — y) probability point of this 
posterior d.f. Then (6.4.8) and (6.4.7) give the result 

u*(x,fi,y,G) = w*(x, 1 -y, G). (6.4.11) 

If an explicit form of the posterior d.f. of Aj +iji-pA 2 is not 
available, u*(x,P,y,G) needs to be obtained by solving the 
equation 

' CO P 

b(A | x, G, m)dA 1 dA 2 = y. 

J0 J u*(x,/S,5',G)-/. 2 !|i -p 

(a) A general result for the single-parameter case 
Let the d.f. F be indexed by a single parameter A and also let d c (A) be 
the c probability point of F. The d.f. F is assumed to belong to a family 
such that d c (A) decreases as A increases; this family of d.f.s has been 
considered by Aitchison (1964). In this case, we have 

A[2,u*(x),/J] = {l:u*(xKd 1 . ? (A)}. 

Let w(x, y r G) be the y probability point of the posterior d.f. of A. Then 
the following subset of the parameter space has probability y under 
the posterior d.f., of A: 

{2:w(x, y, G)^A}. 

However, this set is equivalent to the set given below: 

{A:d 2 -p[w{x, y, G)] < rfj - P (A)}. 

Hence the quantity u*(x,p,y,G) which satisfies (6.4.7) must be given 
by the relation: 

u*(x,P,y,G) = d 1 . f [w(x,y, G)]. 
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6.5 EB tolerance regions 

Suppose that G is not known, but that results from the EB sampling 
scheme described previously are available. Thus we have past 
observation vectors x l5 ..., x„ corresponding to realizations 
of A where x f = (x fl ,..., x jm( ), in addition to the current observation 
vector x corresponding to the current realization A. Then G can be 
estimated from previous data; let G„ be a consistent estimator of G at 
every point of A in the parameter space if. Then EB analogues of 
Bayesian tolerance regions can be developed in general terms by 
substituting G„ for G in the procedures of section 6.4. As discussed in 
the case of EB confidence regions, the estimated Bayesian tolerance 
regions obtained in this way no longer induce the required expected 
coverage or percentile coverage under the posterior d.f. of A. Since G„ 
is a random entity due to variation of x 1 ,...,x n , the estimated 
Bayesian tolerance regions based on G„ are also random and their 
properties with respect to the posterior d.f. need to be assessed 
accordingly. 

Let a Bayesian expected cover region satisfying (6.4.3) be denoted 
by A*(x, (I, G ) and a Bayesian percentile cover region satisfying (6.4.4) 
by A*(x,p,y, G). These regions can be estimated by A*(x,fi, G„) and 
A*(x, p,y, G„) respectively. For the region A*(x,p, G„), the coverage 
probability under the predictive density function is 


C[A*(x, ft, G„), G] 


dP(y\x,G), 


J A^x,P,C n ) 


(6.5.1) 


where P(y|x, G) is the d.f. of the predictive density p(y |x, G). We can 
then assess the sampling properties of the coverage probability 
defined in (6.5.1) as G„ varies. In particular, we could again employ the 
expected cover and percentile cover criteria in terms of this random 
variation. 


Expected cover criterion: An estimated Bayesian expected cover 
tolerance region A*(x, a, G„) of size a is said to satisfy an expected 
cover criterion of size P if a satisfies the relation: 

£„C[A*(x,a,G B ),G] = 0, 

where £„(•) is the expectation with respect to the variation of G„. 

Percentile cover criterion: An estimated Bayesian expected cover 
tolerance region A*(x, a, G„) of size a is said to satisfy a percentile 
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cover criterion with size P at level y if a satisfies the relation: 

Pr„ {C[\4*(x, a, G„), G^Pj^y 

Practical construction of A*(x,a, G„) is not straightforward for a 
general unspecified G. For parametric priors a method of construc¬ 
tion is discussed in section 6.5.1. 

A similar type of treatment can be given to the estimated Bayesian 
percentile cover region A*{x,p,y,G n ). By replacing y by a we can 
consider the coverage of A*(x, /?, ot, G„) under the d.f. F(-|A): 

C[A*(x,/?,a,G n ),A] = f dF(y\X). 

The region in F£ defined by 

A [A, A*(x, p, a, GJ] = {A: C[A*(x, p, a, G„), A] > P}, 
has a probability content under the posterior d.f. as given by 

C[A[A, A*(x, p, ct, G„)], G] = | dB(X\x, G, n). 

(6.5.2) 

The coverage probability defined in (6.5.2) is a random quantity due 
to variation of G„. Thus an adjustment of a in the expression of 
A*(x, P, a, G„) can be made in terms of the expected value or in terms 
of the percentile value of this coverage probability. Further develop¬ 
ment of EB analogues of percentile cover Bayesian tolerance regions 
also requires specialization of the form of d.f. F(jA). 

6.5.1 Construction of EB tolerance regions: parametric G priors 

The general formulation outlined in the previous section can be 
specialized to the case of a parametric G prior distribution denoted by 
G(A|§). Here G is known up to a set of parameters £ so that we will 
denote the expected cover and percentile cover Bayesian tolerance 
regions by A*{x, p, |) and A*(x, P, y, £) respectively to emphasize their 
dependence on £. The unknown parameter vector £ can be estimated 
by the ML estimator <f„ based on an EB scheme as discussed in 
section 2.12. Estimated Bayesian tolerance regions can then be 
obtained by replacing £ by 
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(a) EB analogues of expected cover Bayesian tolerance region 
The estimated Bayesian expected cover region A*(x,f, |) can be 
assessed by looking at the sampling properties of the coverage 
probability (6.5.1) under the predictive density function. Consider the 
special form A*(x,P,%) = [u*(x, /?,§), oo). With a in place of /?, the 
coverage probability defined in (6.5.1) becomes 

C|>l*(x,|,a),£] = 1 -P(u*(x,a,f)|x,£,m). 

The expected value of this coverage probability with respect to 
variation of f can be obtained exactly in principle if the exact 
distribution of f is known. This not being the case in general, we 
proceed to obtain an approximate expression for it. The technical 
results here are now very similar to those of section 6.2.4 with 
P(y|x, §, m) in place of B(2|x,£,m). 

We can define the following derivatives which are all evaluated at 
y = M *(x,«,f): 

Pio = (d/<3y)P(y|x, §,m) 

P 20 = (d 2 /dy 2 )P(y\x,$,m) 

P 0l =(d/d$)P(y\x,4,m) 

P 02 = (d 2 /d4d4 T )P(y\x,Z,m) 

Then to terms of 0(n ~ 1 ), we have 
E n {ClA*(x, a, f),£|} 

= a - </d£) T P 01 + Pro 1 {Ph r(5)P,,} - itr {P 02 r(^)} (6.5.3) 

where and T(§) are, as before, the bias vector and the mean 
square error matrix of the ML estimator %. We may also note here, for 
computational purpose, that the required derivatives may also be 
evaluated from the posterior d.f.: 


Pio = £ fl {/(>'|A)} 

Pio = £ fl |/(3'|A)^ln ff (A|^)j 
P 2 o = £ B |/(y|A)(^ T )ln 3 (A|^)j. 
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Next, consider the construction of an EB analogue of a Bayesian 
expected cover region which satisfies the percentile cover criterion. 
The expected value of the coverage probability C[.4*(x, a, |), £] is 
given by (6.5.3). Its variance can be obtained to the same order of 
approximation as 

var„ {C[A*(x, a, |), §]} = P T 0i K($)P 01 /P? 0 . 

We may now follow an approach of Wald (1942) and apply a normal 
approximation or follow an approach of Wilks (1941) and apply a 
beta distribution approximation to the distribution of C\_A*(x, a, f)£]. 
The results are virtually along the same lines as in section 6.2.4 and 
will not be repeated here. 

(b) EB analogues of percentile cover Bayesian tolerance regions 
Consider the estimated Bayesian percentile cover region A*(x, [l, y, |), 
where 

A*(\, 0, y, |) = [u*(x, p, y, |), oo). 

The coverage probability of this region under the d.f. F(- | X) is given by 

C[A*(x,p,y,$),Xl = l-F(u*(x,p,y,t)\X). 

Consider the region in the parameter space defined by 

A[A, u*(x, p, y, |), /?] = {A: C\_A*(x, p,y, |), X] ^ p}. (6.5.4) 

The posterior probability of the region defined in (6.5.4) is no longer y. 
We can treat A*(x,p,<x,%) as an adjustable region with respect to a 
and consider the posterior coverage probability defined in (6.5.4) with 
a in place of y; this is given by 

C[A[A,w*(x,&a,f),/3],S] = j dB(X\x, m). 


6.5.2 Location and scale family F 

When F(y| X) is of the form F((y — X 1 )/X 2 ), we have (6.5.4) with a in 
place of y as follows: 

A[X, u*(x, p, a, l),pl = {X:X 1 + t 1l _„X 2 > u*{x, p,a, |)}. (6.5.5) 

1. X 2 known: using the expression (6.4.9), (6.5.5) becomes 

A[A, u*(x, p, a, |), = w^(x, 1 - a, |)}. 
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In this case, adjusting a in terms of a preassigned value of y while 
taking in account the random variation of f is the same as 
obtaining EB confidence limit for A,. Thus the methods discussed 
in section 6.2.4 are directly applicable here. 

2. Aj known: using the expression (6.4.10), (6.5.5) becomes 

A[A, «*(x, /?, a, |), /?] = {A 2 :A 2 ^ w (2) (x, 1 -a,|)}. 

In this case again, the problem of adjusting a reduces to that of 
obtaining a lower EB confidence limit for A 2 . Thus one can again 
directly apply the methods of section 6.2.4. 

3. Both Aj and A 2 unknown: let A* be a new r.v. given by Aj + 
> 7 1 _^A 2 . Using the expression (6.4.11), (6.5.5) becomes 

A[A, u*(x, P, cl, f), )3] = {A*: A* ^ w*(x, 1 — a, f)}. 

Again the problem of adjusting a is reduced to that of an 
EB confidence lower limit for A* if an explicit expression for 
w*(x, 1 — a, |) is available. Otherwise approximate solutions 
need to be sought. 

(a) A special single-parameter case 

Consider the data d.f. F( |A) with the special property described in 
section 6.4.2. For this case, (6.5.4) gives 

A [A, u*(x, p, a, |), /?] = (A: u*(x, / 3, a, |) ^ d, _„(A)}. 

Using the monotonicity of d c {-) again, this set is equivalent to 

{A:d 1 _ < ,[w(x,a, |)] <di-^(A)} 

which is again equivalent to 

{A: w(x, a,|) ^ A}. 

Hence the problem of adjusting a again reduces to that of finding an 
EB upper confidence limit for A with level y. The methods of 
section 6.2.4 are again directly applicable. 

6.6 Other approaches to EB interval and region estimation 

Empirical Bayes interval estimation is a fairly recent development. In 
the preceding sections we have provided an EB approach by 
exploiting connections with the classical problem of tolerance region 
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estimation. This is in the spirit of Cox (1975) and Lwin and Maritz 
(1976). Technically and conceptually it differs from classical tolerance 
theory in that the expected cover and the percentile cover are 
computed conditionally on the current observed data. In Lwin and 
Maritz (1976) such an approach is explicitly defined, but the actual 
calculations for specific examples are made in terms of the sampling 
variation of all the data in the EB scheme, This produces an 
unconditional probability statement for the EB interval or region. 
Cox’s approach is also unconditional in this sense. 

The earliest development of an EB interval estimate seems to be 
due to Deely and Zimmer (1969) although the first use of the term ‘EB 
interval estimate’ appears to be by Cox (1975). In both of these works 
the prediction interval (i.e. expected cover tolerance interval) ap¬ 
proach is used, with particular attention to normally distributed data. 
Again, these approaches are based on unconditional probability 
statements. Lord and Cressie (1975) considered prediction limits for X 
based on the AonI regression in the joint distribution of (X, A). A 
prediction interval for a future observation on A is constructed as 
though A is an observable r.v. with observations X l ,X 2 ,-..,X k in an 
EB scheme. The uncertainty in not knowing the parameters in the 
joint distribution is allowed for in the confidence statement. Lord and 
Cressie apply this method to the case when the distribution of X, 
conditional on X, is binomial. 

Deely and Lindley (1981) provide a systematic basis for interval 
estimation in the full Bayes EB approach. The relation (1.14.8) can be 
used to obtain posterior limits for the current unknown parameter 
since the posterior distribution is fully determined once the third stage 
prior distribution function P(<j>) is specified. The approach hinges on 
the specification of P{4>). It is also worth noting that the advocates of 
this approach are not primarily concerned with the usual sampling 
properties of the intervals. 

Morris (1983a, b) also gives an EB interval estimation technique. 
Although the sampling properties of the intervals are of main interest, 
the method is similar to the Bayes EB method. Laird and Louis (1987) 
present a general bootstrap approach to the problem of EB interval 
estimation, comparing their results with those of Morris. 

In the following sections we give brief summaries of Morris and 
bootstrap intervals, confining discussion to the structure of the FB 
approach given in section 1.14.2. 
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6.6.1 Morris EB interval estimates 

The posterior joint distribution given in (1.14.8) is recast in the form 

B(X u X 2 ,...,X k \X) = jfl B^xJdPWW). (6.6.1) 

In this equation B(2j |x i( <f>) is the posterior d.f. of X t given the ith stage 
data x f and the prior d.f. G(2|<£). Also, P(<f> |x 1 ,x 2 ,...,x k ) is the 
posterior distribution function of 4> derived from the hyper prior 
d.f. P(d>) given the data (x„x 2 ,...,x k ). The evaluation of 
P(</>|x 1 ,x 2 ,...,x k ) is often performed with P((p) replaced by a ‘flat’ 
prior. The relation (6.6.1) is used by Morris to obtain approximations 
for the posterior mean and variance of a specified Xj. 

From (6.6.1) the posterior marginal d.f. of a particular A, is 


B(Xj |x 1 ,x 2 ,...,x k ) = 


B(Xj\Xj, <t>)dP(<p |xj, x 2 ,..., x k ). 


( 6 . 6 . 2 ) 


From (6.6.2) it is also useful to express the posterior mean and 
variance of Aj as 

F(Aj|Xi,x 2 ,... ,x k ) = E P E(Aj\Xj, 0), (6.6.3) 

and 

var(A J |x 1 ,x 2 ,...,x k ) 

= E P \&T(Aj\Xj,(l)) + \at P E{Aj\Xj,(f>). (6.6.4) 

In these expressions the subscript refers to the posterior d.f. 
W|Xi,X 2 ,...,X k ). 

Example 6.6.1 We consider the case where F(x|A) is N(X,a 2 ) with 
known <r 2 , and G is N{£,z 2 ), and = m 2 = ••• = m k = 1. Morris 
(1983a) gives explicit approximate expressions for the mean and 
variance in (6.6.3) and (6.6.4) by employing Taylor series expansions 
for E{Aj\xj,<l>) and var(A^|x J .,^») at cp = E(<j> |X). In these calculations 
the flat priors for £. and t 2 are uniform on ( — oo, + oo) and (0, oo) 
respectively. The following approximations are given: 

£(Aj|x 1 ,x 2 ,...,x k ) = cx + (l -c)xj, 
var(A;|Xj,x 2 ,...,x k ) = (l — t)a 2 + co 2 /k + 2(xj —x) 2 c 2 /(k — 3), 
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where 

, (fc- 3 )<t 2 

ZUAxi-x) 2 ' 

The posterior distribution of A; can be approximated by a normal 
distribution with mean and variance given above. Hence the EB two- 
sided limits for A of size a are given by 

X L , X v = £(A| x„ x 2 ,..., x*) ± z (1 -.vztvarfAlxj, x 2 ,..., x*)] 1/2 . 

(6.6.5) 

It is claimed that 

Pr < Xj< X v } ^ a, 

where the probability is calculated according to the joint distribution 
of (X, A), i.e. the probability is unconditional. 

Morris (1983b) considers extensions of this example to the case 
where the data distribution of the ith component is A(A„ cr 2 ). Also, the 
prior mean of A, is taken to change deterministically according to 

E G (Ai) = »ll 

where a J is a vector of covariates and § is a vector of unknown 
parameters. 


6.2.2 Bootstrap EB intervals 

Suppose that N bootstrap samples are generated from a relevant 
model, providing N estimates, $ x ,<j> 2 ,...,4> k , of the parameter <f> of 
the prior G can be obtained. Methods of generating bootstrap 
samples are given below. The posterior distribution of A, given by 
(6.6.2) can then be estimated from the bootstrap samples as 

B(Aj|x 1 ,x 2 ,...,x*)= X ^(Ajlxj, <pi)/N, (6.6.6) 

i=l 

The estimated distribution in (6.6.6) can be used to obtain ‘confidence’ 
intervals for A y . Fora 100a% two-sided interval we need to find A t and 
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Xu such that 

dfl(A J .|x 1 ,x 2 ,...,x )i ) 

— 00 
* + 00 

= dB(Xj ; \x 1 ,x 2 ,...,x k ). 

J 

Laird and Louis (1987) suggest three types of models for producing 
bootstrap samples, appropriate when there is just one observation at 
each component: 

1. Construct the empirical distribution function H k of the observ¬ 
ations x () X 2 .x t . Then generate N samples of size k from this 

distribution. Each of these samples is used to calculate an estimate 
of <f>, thus creating the sequence 

2. Using the data of the EB scheme the prior G is estimated by the 

nonparametric ML method described in section 2.10. Let the 
estimate of G be G k . Then perform the following operations N 
times (i.e. for i= 1,2,..., N): generate a sample of X values 
(2fi, , AJD) using this G k . Then generate an observation xjj 

from F(x\Xfi) for; = 1,2,..., k. This creates a set of x observations 
which can be used to calculate an estimate of <p. In this model the 
form of F is assumed known while G is not specified. 

3. This model is like 2 above except that a parametric form of G(A | <j>) 
of G is assumed. Initially the parameter (j> is estimated by a 
standard method such as maximum likelihood, yielding $. Then 
the distribution G(2\<p) is used exactly like G k in 2. 


1 — a 
2 



CHAPTER 7 


Alternatives to empirical Bayes 


7.1 Introduction 

One view of the development of the EB approach is that it is an 
attractive compromise between the classical non-Bayes and the full 
Bayes approaches to statistical inference. These represent extremes in 
that the former uses no prior information whereas the latter requires 
complete specification of a prior distribution. The EB approach uses 
previous data to get an estimate of the prior distribution. The 
previous data and current data are linked in the form of a two-stage 
sampling scheme by a common prior distribution G of the unknown 
parameters; see section 1.8. 

The EB method is actually only one of several methods of more 
effective utilization of data from such a two-stage sampling scheme. 
Established competitors of the EB method are: compound decision 
theory, the full Bayesian multiparameter approach and a modified 
likelihood approach. These methods treat the EB scheme as a 
multivariate case where k = n + 1 variables are observed, each having 
a distribution with its own unknown parameter. The joint distri¬ 
bution of the k variables X ( , i= l,2,...,k, is the product of k 
individual distributions owing to the independence of the X t ’ s. 

Compound decision theory assumes no prior distribution but uses 
a compound loss structure, i.e. the loss in designing a decision rule is 
typically taken to be the sum of losses incurred in making decisions 
about the k parameters. The full Bayesian approach does not use 
compound loss but requires specification of a hyper-prior distri¬ 
bution of the k parameters. 

7.2 The multiparameter full Bayesian approach 

This technique was introduced briefly by Lindley (1962) as a Bayesian 
alternative to the compound decision theoretic approach in a 
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discussion of a paper presented by Stein (1962). A more systematic 
exposition was given by Lindley (1971) for the special case when the 
individual component distributions F(x i |2 i ) are N{X h a 2 ), and there 
are 1 observations on X t in the EB sampling scheme. In 
section 1.14 this approach is introduced for the case m,= 1. The 
hyper-prior distribution used in this case is a mixture of IV(p, t 2 ) 
distributions with respect to a diffuse prior distribution of p. Both a 2 
and t 2 are assumed known. This construction of a hyper prior is based 
on the concept of exhangeability due to de Finetti (1964). The full 
Bayesian developments up to and including that in Lindley (1971) can 
be deduced as special cases of a general theory developed by Lindley 
and Smith (1972); a summary is given in the next section. 

7.2.1 The general Bayesian linear model 

The following assumptions characterize the classes of data and prior 
distributions of the general Bayesian linear model of Lindley and 
Smith (1972). Let Y, 0 l5 0 2 , 0 3 be N x l,p i x l,p 2 x 1, p 3 x 1 vectors. 
Let A ls A 2 , A 3 , C„ C 2 , C 3 be N x p u p l x p 2 , p 2 x p 3 , p x x p 1( 
Pz * Pi, Pz * Pz matrices, C,, C 2 , C 3 being positive definite. Also 
assume that the conditional distributions of Y given 0 U 0, given 0 2 , 
0 2 given 0 3 are as follows: 

Y|0j is N(A&.Ci) 

0 t \0 2 is N(\ 2 0 2 ,C 2 ) (7.2.1) 

0 2 \0 3 is 1V(A 3 0 3 ,C 3 ) 

The variables 0, and 0 2 are unobservables, while 0 3 is assumed to be 
given. The matrices A,- are ‘design’ matrices and are also assumed 
known. The matrices C ; are in general unknown parameters, but for 
the first stage of the development of the model they also are assumed 
known. The assumptions that 0 X 1 0 2 and 0 2 1 0 3 have distributions are 
regarded as representations of exchangeability between elements 
of 0,. 

The main result deduced from the above specifications is that the 
conditional distribution of 0j given y and 0 3 is lV(Qq, Q), where 

q = AfC 2 'y + (C 2 + AjCjAl)' 1 A 2 A 3 0 3 (7.2.2) 

Q _1 = AjCf 1 A, +(C 2 + A 2 C 3 A|) _1 
The full Bayesian approach regards this posterior distribution as the 
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final result, summarizing all data and prior information. Its mean or 
its mode, these two being identical in the case considered, is regarded 
as a reasonable point estimate of 6 l . 

When CJ 1 = 0 the posterior mean Qq becomes Q 0 q 0 with 

q 0 = A{Cr 1 y 

Qo 1 = AlCf 1 A t + C 2 1 - C 2 A 2 {A\C 2 1 \ 2 )~ 1 AICJ 1 • (7.2.3) 
Alternatively we may write the posterior mean as 

E(0 \1 y, 0 3 ) = (AlCr 1 A 2 + C 2 -')- 1 (AlCr , A 1 0 1 + C 2 1 A 2 0 2 ) 

(7.2.4) 

where 

^-(ATcr^r^icf 1 

0 2 = {(A 1 A 2 ) t (C 1 + A 1 C 2 A 1 )- 1 (7.2.5) 

x (A,A 2 )} - 1 (A 1 A 2 ) t (C 1 + A 1 C 2 A 1 ) -1 y. 

Some remarks are in order. One can interpret C 2 1 = 0 as reflecting 
prior ignorance at the third stage, and when this holds we note that 
the posterior mean is independent of 0 3 . The posterior mean, 
E(0 j | y, 0 3 ), is then completely determined by the design matrices and 
the observed data. This is useful for application in a two-stage scheme 
where the second stage has a proper probabilistic structure which 
cannot be ignored. This is precisely the nature of an EB scheme, so 
that the full Bayesian approach is an alternative to the EB methods 
discussed in previous chapters. 

Example 7.2.1 The one-way ANOVA model. We now give an 
application of the above general result to the case considered in 
section 1.14. Let the vectors and matrices of the general linear 
Bayesian model be specialized as follows: 

^ = (-*ll>..">-*lmi> *2 * 2 m 2 > Xkntk) 

= (2j,..., Aj,), 0 2 = p. 

"•«, K - <C 

A 0|fl2 1 m2 

®m k 1 m k 

where l r is a column vector of r ones, and 0 r is a column vector of r 
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zeroes. Also, 

= 1*, C, = I^CT 2 , C 2 = I*t 2 , 
where I r is an r x r identity matrix. Then we have 
q 1 = a- 2 (m l x 1 .,...,m k x k .), 

where x h — (x n + ••• + x im )/m i is the mean of the observations at 
component i, and Q 0 is a k x k matrix with elements 


«S = i 


0 -®i) 1 + Z 


r*i 


2 (1 -©»)(! -W;) Z®r . 


i—j 


otherwise, 


with coj = mj<r 2 /(mj(T 2 + r 2 ). See also Smith (1973). 

The full Bayes estimate of k t is found to be 

£(A;|y) = (m,x ( t 2 + /i*<T 2 )/(m ; T 2 + ff 2 ) (7.2.6) 

where p* = i w,*;/£?= i cu,- The posterior variance of A, is q% given 

above. When every m ( = 1 these results reduce to the corresponding 
expressions in section 1.14. 


Example 7.2.2 Ridge regression. Suppose we specialize the general 
Bayesian linear model as follows: 

A, is an h x p matrix X of design variables 
A 2 is a p x p identity matrix I p 

Ci=i y 

c 2 = IpT 2 

0 1 is a p x 1 random vector ft 

0 2 is a p x 1 random vector 1C, where C is a random variable. 

Then we have an n x 1 data vector Y with N(Xfi, I„cr 2 ) distribution, 
while /l has a JV(1C, I p r 2 ) distribution, and the r.v. C represents third- 
stage prior ignorance. From (7.2.3) we get 

q 0 = X T Y/a 2 

Q- 1 = {(X T X)/cr 2 + I p /r 2 - (per 2 )" 1 J p }, 
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where J p is a p x p matrix all of whose elements have the value 1. 
Hence we get 

m Y = y) = {I + (X T X)- l (I p - j p /p)a 2 /x 2 }- l p, (7.2.7) 

where fi is the usual least squares estimator of fi. The result (7.2.7) is 
the full Bayesian analogue of the ridge regression model proposed in a 
non-Bayes context by Hoerl and Kennard (1970). 

The assumption that the components of p are independent r.v.s 
with common mean and variance is interpreted as a representation of 
exchangeability within the multiple regression equations. An alterna¬ 
tive Bayesian model is to take only a two-stage version of the general 
linear Bayesian model with the same special forms of the vectors and 
matrices above but f = 0 a constant. This gives the estimate 

0 + = {l+(X T XrV/T 2 }" 1 £ (7.2.8) 

which is of the same form as the ridge regression estimator. 

Lindley and Smith (1972) give an example where the assumption of 
exchangeability may be realistic. It is in an educational testing context 
where the p regressor variables might be the results of p tests applied 
to students, and the dependent variable Y a measure of the students’ 
performance after training. Individual regression coefficients may 
then be regarded as exchangeable after a rescaling of the regressor 
variables so that X T X becomes a correlation matrix. 

As introduced above, the full Bayes approach deals with simulta¬ 
neous estimation of all of the elements of 0 U the ‘first-stage’ 
parameters of the general linear model. In the standard EB frame¬ 
work the different elements of 0 1 represent different components of 
the EB scheme, and usually the last element of 0 l is of interest as the 
current parameter. The full Bayes approach, as presented, requires 
specification of parametric forms of the priors and hyper priors; in 
particular it seems that it is strongly dependent on normality 
assumptions. However, a later development by Deely and Lindley 
(1981) extends it to general non-normal data and prior distributions. 
This development, which has been termed Bayes empirical Bayes 
considers estimation of only a single parameter, at the (n + l)th 
component of the EB scheme. The transition to the simultaneous 
estimation of all parameters in the k — n + 1 components is then 
straightforward. 
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7.2.2 Bayes empirical Bayes 

Consider the EB scheme as in (1.8.1). Following Deely and Lindley 
(1981) we specify 

/(XilAj), the conditional p.d.f. of X f | i= 1,2,..., fc, 

g(X\0), the conditional prior p.d.f. of A|0, 

a(0\ <f>), the conditional hyper-prior p.d.f. of 0\ <f>. 

As before, the assumptions of p.d.f.s g{X\0) and a(0\ <f>) are justified by 
exchangeability. Putting 

C(x, X, 0, <t>) = fl /(X|| Xt)}g(X\ 0)a(0\ </>), 

i = 1 

the posterior p.d.f. of A given x 1 ,x 2 ,...,x fc and (f> is 


dB(X |x 1 ,x 2 ,...,x t ; <f>) 


C{x,X,0,</>)d0dX 


C(x,X,0,tf>)d0. 


The integrals appearing in the above expression cannot readily be 
evaluated when the distributions involved are not normal. Deely and 
Lindley (1981), following Lindley (1961), used approximations in 
terms of the derivatives of a(0\<f>) and of the ‘marginal’ p.d.f. 


/i(x|0) = fc(x 1 ,x 2 ,...,x k |0) = 



g(X\0)dX 


to derive the result 

dB{X | x!, x 2 ,..., x k , </>) ~ b(X | x j, x 2 ,..., x k , 0), 


where 


b(X\x l ,x 2 ,...,x k ,0) = (h(x|0)} 1 n /( x i \X)g{X\0), 

i= 1 

is the posterior p.d.f. of X given x 1 ,x 2 ,...,x k and 0, and 0 is the 
estimate of 0 obtained by maximizing h(x\0) w.r.t. 0. 

The result above is interesting for two reasons. First, it is a 
practically applicable result which requires no knowledge of hyper 
priors or their parameters. Second, it gives a justification for the 
standard EB procedures with parametric prior g{X\0) as an approxi¬ 
mation to the full Bayesian solution. 
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7.3 Likelihood-based approaches 

7.3.1 A modified likelihood approach 

A brief introduction to a ‘likelihood type’ approach to the EB scheme 
is given in section 1.14. Recapitulating, the main points are: 

1. The pairs (x ( , A f ), i = l,2,...,/coftheEBscheme(1.8.1)areregarded 
as independent realizations of r.v. pairs (X i( A,) with joint p.d.f. 

AxMMm- 

2. The joint p.d.f. of all pairs is 

L(x,X)= ft (7.3.1) 

i= 1 

and it is regarded as a ‘likelihood function’ for the unobservables 
i i5 1'= 1,2,...k. 

3. The marginal p.d.f. of the x ; ’s, given by 

h k (x\0) = j*L(x, A)dA, 

is regarded as a ‘likelihood function’ for the parameter 0. 

4. The function L(x, A) is used to obtain a ‘likelihood’ estimate of X in 
terms of x and 0; let A(x, 0) be this estimate. 

5. The ‘likelihood’ h k (x \ 0) is used to obtain a 0 estimate of 0. 

6. The estimate for X is finally given as A(x, 0). 

The theory behind the above approach at first seems to be somewhat 
different from that behind the standard EB procedure. In the above 
form it seems to have appeared first in Nelder (1972) and Finney 
(1974). As will be shown below, it is in agreement with the standard 
EB method for the parametric G case. The following example is from 
Finney (1974). 

Example 7.3.1 Suppose that X ; is the single random variable X, and 
that the p.d.f. /(x, \ A ; ) is N{X h a 2 ). Suppose also that g(Xi\ 0) is N(p, t 2 ); 
note that 0={p,x 2 ) T . The logarithm of the modified likelihood 
function Ux, A) is then 

Constant - £ (x f - A,) 2 /(2(r 2 ) - £ (A, - p) 2 /(2r 2 ). 
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Maximization of this function leads to the estimates 

It 

fi = x= Y, x t /k 

i = 1 

X; = X + (1 - T 2 /{a 2 + T 2 )}(x ; —x). 

Thus the modified likelihood approach leads to the usual EB 
estimate. It should be noted that the approach in this example differs 
slightly from the general outline given above in that the parameter g 
of the prior distribution is also estimated using the likelihood 
function. 

Questions have arisen regarding the modified likelihood approach, 
mainly directed at steps 2 and 4 above. Can one justify the use of 
L(x, A) to construct estimates of A? What are the properties of such 
estimates? These questions are prompted by the fact that L(x, A) is not 
a likelihood in the usual sense because A is an unobservable random 
variable. 

A justification can be given in terms of distance measures between 
distributions. In particular, the Kullback Leibler criterion can be 
employed. The distance between the empirical distributions of x t and 
A f is minimized if £ln0(A;|0) is minimized. We can regard A as a 
‘point’ intermediate between the empirical distribution of the x/s and 
the theoretical distribution of the A f ’s. We can then regard the total 
distance between the empirical distribution of the x f ’s and the 
theoretical distribution of the A,’s as being minimized when we 
minimize 

t ln/(*,|A,)+ t 100(4,10) 

i=l i=l 

which is just the logarithm of L(x, A), in (7.3.1). 

7.3.2 Empirical regression estimation 

The joint p.d.f. of observables and unobservables introduced in 
section 7.3.1 can be refactored as 

L(x, A) = h(A | x, 0)h k (x, j <j>), 

where the entities on the right side are defined as in section 7.2.1. 
Under regularity conditions, including that the range of /(-|A) be 
independent of A ; , maximizing L(x,A) amounts to maximizing the 
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posterior p.d.f. b(k\x, 0) with respect to k. Hence the step 4 of 
section 7.3.1 is equivalent to obtaining the posterior mode of k. One 
can therefore consider replacing step 4 by a more general procedure 
such as: 

4a. Obtain a summary of b(k\ x, 0) such as the posterior mode, mean 
or a percentile. Each of these will in general be a function of x 
and 0. 

Using the posterior mean in 4a leads to an alternative modified 
likelihood approach which is equivalent to a standard EB approach 
with quadratic loss function. This alternative modified likelihood 
approach appeared in the literature of genetic selection in a paper by 
Fairfield-Smith (1936). Developments along these lines were re¬ 
examined and given prominence by Rao (1975), who referred to this 
approach as empirical regression estimation (ERE). The use of the 
posterior mean in 4a is justified by a regression construction, i.e. 
regression of the unobservable k on the observations x. Apart from 
this change, the ERE is a modified likelihood procedure, and follows 
the steps 1-4 in section 7.3.1. Thus it may also be regarded as a special 
case of the EB method. Further developments of the ERE approach 
have been restricted mainly to estimates that are linear in the 
observations. This has the advantage that only first- and second- 
order moments need to be specified or estimated in practical 
applications. Indeed, the early ERE methods can be regarded as 
forerunners of the linear EB methods described in Chapters 3 and 4. 
In the following subsections a brief account of early developments in 
ERE methods is given with special reference to examples that were 
examined in this area. Material is drawn largely from Rao (1975). 

(a) A general linear model for the ERE appraoch 

The basic model 

Consider the vector random variables k and y, where k = 
{k u k 2 ,...,k k ), y = (yi,y 2 >-••>>'*) and let U i = k i + e i , /= 1,2,...,k. 
The e/s have a distribution F such that F F (e) = 0 and cov f (e) = o 2 \, 
V being a known k x k matrix. The A.’s are the parameters of interest 
and they are assumed to be generated by a prior distribution G. Also, 
put k = 0+ 17 , where 0 is an unknown parameter and 1 / is a random 
vector with E G (t]) = 0 and cov c (^) = r 2 W, where W is a known 
matrix. 
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We shall be concerned with the regression of A (or 0) on Y for fixed 
values of a 2 , z 2 and (j) (or fi) and note that 

£(A| Y = y) = </> + (t 2 W + <r 2 V)~ V 2 V(y - <t>). 

This follows from the results on linear Bayes estimation in section 4.4. 

A prediction model for genetic breeding ability 
A variant of the basic model is the case when 0 depends linearly on a 
set of covariates. This model has been used in genetics, and elsewhere, 
and is specified by the further relation 0 = Zp. Here Z is a k x s matrix 
of known elements representing covariates, and ft is a vector of 
unknown parameters. See also section 4.7 for a discussion of con¬ 
comitant variables in the EB framework. The problem in the genetic 
context is to obtain a predictor for fj which is regarded as the true 
animal breeding ability. 

The regression of q on Y is given by 

E(tj | Y = y) = (t 2 W + <t 2 V)~ VV(y - Zp). 

The Gauss-Markov model with random coefficients 
Another variant of the basic model is obtained when X = ZO, and 
0 is a realization of a random variable 0 such that £(©) = P and 
cov (0) = r 2 r. In this formulation fi is an s x 1 vector of un¬ 
known parameters, t 2 is an unknown parameter and T is a 
known s xs matrix. The regression of 0 on Y is given by 

£(01Y = y) =/? + {t 2 T + (t 2 (Z t V - 1 } - 1 

x a 2 (Z T \~ 1 Z)~ 1 (§—fi), 

where 

0 = (Z T V _ 1 Z) ~ 1 Z T V ” 1 Y 

is the usual weighted least squares estimate of 0. This result follows 
from the application of the linear Bayes method to 0 whose sampling 
distribution has the moments 

E(0 \0) = 0, cov(0| 0) = (Z t V - 1 Z) - 1 <t 2 . 

(b) Empirical regression estimates 

The regression functions obtained in subsection (a) above are in terms 
of parameters o 2 , z 2 and tj>, or p. These parameters are generally not 
known, and in order to estimate them appropriate sampling schemes 
and data are needed. 
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Basic model 

Consider the EB scheme (1.8.1) where observations are made on X t 
at the ith component. Now let Y T = (X 1 ,X 2 ,...,X k ). Its ith element 
X t = (X n + X i2 + —h X im .)/mi can be written as X t = X t + e f , 
i= l,2,...,fc. The r.v. e has the mean and variance structure of the 
basic model with V a k x k diagonal matrix with ith diagonal element 
1/m, . Since the 2,’s are independent realizations of A with mean n G we 
can take 6 in the basic model to be the k x 1 vector all of whose 
elements have the value n G . The covariance matrix of H is t 2 I t . The 
parameters fi G and r 2 can be estimated 

k 

fia = £ 

i - 1 

f 2 = £ xf/ k - o 2 1 £ (l/m f )/k j, 

where <x 2 is the usual ‘within-groups’ sample variance. The ERE of 
X is now obtained as 

X(ERE) = {i G 1 + (f 2 I + d 2 Vy 1 d 2 V(y - fi G l). 

An application of this ERE is found in a genetic selection problem. 
Suppose that X is a vector of genetic variables and that a linear 
function, a T A, for given known a is a genetic value of an individual. 
Then a good index for selecting and comparing individuals with 
respect to the genetic value is a T X(ERE). A more general treatment of 
this type of genetic selection problem is given by Rao (1977). 

Prediction of genetic breeding ability 

Consider the model for prediction of genetic breeding ability, tj = X 
-z p, as before. We assume that the covariance matrix of tj is t 2 I t . 
Using x, in the place of y t , the regression of 7 on y can be expressed as 

£(«?|Y = y) = {a 2 /(< 7 2 + T 2 )}(y-Z/I). 

Given an EB scheme as in (1.8.1) a 2 can be estimated as in the previous 
section. Adjusting for the covariates, o z + 1 2 can be estimated by 
(y — Z/?) T (y — Zfi)/(k — r), where r is the rank of Z, and 
P = (Z T V - 1 Z) “ 1 Z T V " x y. 


The Gauss-Markov model with random coefficients 

The parameters p and t 2 are to be estimated when the prior 

distribution in unknown. Suppose that p= ly. Then, based on y, with 
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y t = x„ moment estimates for y and t 2 are given in Rao (1975) as 
y = 1 t (Z t V- 1 Z)^/1 T (Z t V _ 1 Z)1 
f 2 = (0 3- lmz'v- l zy 1 y)/(r - 3) - & 2 }/D, 

where 

D = [tr(Z T V _1 Z) _1 r 

- {1 T (Z T V- ‘Z)- T(Z T V- 1 Z)- 1 1}/{1 T (Z T V-»Z)- l ]/(r - 1)}, 

and <t 2 is the usual estimate of the error variance, 

6 2 = (Y — Z/») T V - 1 (Y - Zfi)/(k -r + 2). 

It should be noted here that f 2 and d 2 are not unbiased estimates of t 2 
and <x 2 . 

The situation above is the same as the exchangeability within the 
regression coefficients as considered by Lindley and Smith (1972). 
Thus the ERE for ft obtained in this way is an analogue if the ridge 
regression estimate of Hoerl and Kennard (1970), with shrinkage to y 
instead of the origin of the regression coefficients. 

When p has no structure previous data are needed in the form 

yi,y 2 .y»»y. 

where y f is generated by a realization of 0 t of 0. This case has been 
treated by Rao (1975). A related more general multivariate linear case 
is considered by Efron and Morris (1972). The data structure now is 
such that one is dealing with different linear models whose parameters 
are estimated simultaneously. 


7.4 Compound estimation and decision theory 

This is another alternative to EB techniques, which could be regarded 
as a stimulus for the creation of the EB approach. Originated by 
Robbins (1951) before the introduction of his EB approach in 
Robbins (1955), compound decision theory deals with the same 
sampling scheme (1.8.1). The difference now is that no a priori 
distribution is assumed as generating the parameter values 
Aj,A 2 ,...,A t . In introducing compound decision theory Robbins 
(1951) considered the problem of decision between two simple 
hypotheses, assuming normal data distributions. Robbins did not 
proceed to point estimation, which was taken up by Stein (1955) and 
James and Stein (1961), who produced the James-Stein estimator. 
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Much further literature has since been devoted to both compound 
estimation and decision. Our present purpose is not to give an 
extensive review of the subject, but only to give an outline of the basic 
ideas, tracing the connection with EB methods. In particular, the aim 
is to reinforce the main theme that following the techniques suggested 
by the EB approach provides one route by which compound 
estimators or decision rules can be developed. 

7.4.1 Compound estimation 

(a) Component and compound decision problems 
In section 1.14 a linear compound estimator for simultaneous 
estimation of the k parameters X h i = 1,2,..., k, in the EB scheme was 
introduced. The key to the construction of a compound estimator is in 
minimizing a compound risk function rather than a component risk 
function. The special case dealt with in section 1.14 assumes that the 
observation x ( at the ith component is generated by a N(2 h o 2 ) 
distribution. Essentially the same technique can be used in a more 
general setting without assumption of a prior distribution. 

Suppose that X u X 2 ,...,X k are independent r.v.s and that for 
i- l,2,...,k the distribution function of X t is F(x,|A,), depending on 
the parameter A,. Suppose also that m ( 5= 1 independent observations, 
Xjj, j=l,2,..., m h are made on X t . Let x ( be the vector whose 
elements are x y , j = l,2,...,k; if m f =l, x, = x,. An estimator is 
sought for the vector A = (A u A 2 , ■ ■ ■, A*). 

Since the X t are mutually independent, any two estimation 
problems concerning A f and A ; , i / j, can be regarded as unrelated. 
Hence it may be thought that an estimator of A, should depend on x f 
alone and be of the form 

A{ 0) = <5 <0) (x f ). (7.4.1) 

However, for the vector A the simple estimator A (0) whose elements 
are AS 0) , i — 1,2,..., k, is not necessarily optimal in term of compound 
risk. The compound risk function is 

R ik) (X, A) = E D L (k) (X, A), (7.4.2) 

where D is the distribution of X and E D is expectation w.r.t. D; the 
compound loss L W (A, A) is the average of the component losses 
L,(Ai, AJ, i = 1,2,..., k. The expected value w.r.t. D of the component 
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loss is called the component risk. A typical choice for the form of 
component loss is A f ) = (I* — A ;) 2 so that 

L«*>(X,A)= £ fa-tf/k. (7.4.3) 

i=l 

The best estimator of A may be such that the estimator of A, depends 
not only on x, but also on \j for j # i. In general the estimator of A, 
should be of the form 

Af = d i {x i ,x 2 ,...,x k ) = t 5,(x), (7.4.4) 

In (7.4.4) it is assumed that the decisions are to be made after all 
observations on each variable have been made. A special class of rules 
of the type of (7.4.4) is the symmetric rules satisfying also 

< 5 i( 4 x) = qS,(x), qeQ, (7.4.5) 

where Q is the set of permutations of the integers 1,2,..., k. The rule 
defined in (7.4.1) belongs to the class of symmetric rules. 

Rules of the type A <0) are called non-compound so as not to confuse 
terminology with that of simple EB rules, and those of (7.4.5) are 
symmetric compound rules. A non-compound rule can be obtained 
by using any conventional estimate of A f in the place of <5j 0 ) (x ( ). On the 
other hand, compound rules are obtained by direct consideration of 
the compound loss. Special forms of A + have been motivated by the 
fact that the ‘best’ symmetric rule is often a non-compound rule being 
a functional of a symmetric function g( A) of A. If it is possible to 
estimate g(A) from the entire data set replacing it by the estimate 
produces a compound estimate. To illustrate, we consider a special 
class of estimators in the next section. 

(b) Optimal linear estimators 
Consider estimators of A, of the form 

mi 

Ai = a 0i + £ ajtX^. (7.4.6) 

i= l 

We shall determine the constants a,, in such a way as to minimise the 
compound risk 

R<“(A, A) = E D k~ 1 t {(«o« + .2 ■ 
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Straightforward calculations show that 

a ji = olKmpl + o) x ), y= 1,2,..., m„ 

“oi = + a>x), 

where £(X,|A i ) = A i ,_var(X ( |A,) = ff l 2 ,I=Ef = iA l /k, (o x = 'Z l ‘ =l o 2 /k h 
and <r* = Xf= i(^> — A) 2 /fc. Thus the optimal linear estimator is defined 
by 

X* = X+(1-<•*,.)(*,-X), (7.4.7) 

where 


— m < 

= E C A, = 0>a/("*,-0-| + CO*). 

J =1 

We note that 2* is a non-compound rule and a function of X and the 
c* f which are symmetric functionsof the X t . The quantities Xand c Xi are 
generally unknown in practice, but they can be estimated using the 
full set of data, as we show below. Replacing these unknowns by their 
estimates produces a compound estimator, because each Xf is 
replaced by a quantity depending not only on the observations in 
component i but also on all other observations. 


(c) Linear compound estimates of means 

A special structure for of 
We now assume that 

of = (A — l)Af + BA/ + C, (7.4.8) 

a relation satisfied by many important data distributions. We also 
assume that A, B and C are common to all component problems, and 
that every m i ^l — A. Estimates of X and c Xi can now be obtained by 
the following steps: let 

X.= t rnJil t m h 

i- 1 / i=1 

S xx =im i {X l -X.m- 1), 

i= 1 

17, = mfm, — A + l) -1 [(/4 — 1 )Xf + BX t + C], 
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Then unbiased estimates of l, co and o 2 are given by 

X=X. 

«,= t UJk 

i=i 

H = Sxx ~ <*>x k ~ 1 ^ Z m t r 1 J 

These unbiased estimates of co x and o\ can be negative and the 
truncated estimates defined as follows can be used instead: 

of = max (<rj, I/7c), (b x = max (oj x , l/k). 

The quantity c Xi is then estimated by 

c Xi = <+ a> x ), 

while the optimal non-compound estimator is estimated according to 

Xi = I+(l -d^Xi-Z). (7.4.9) 

This is now a symmetric compound estimator and it is an obvious 
analogue of the James-Stein estimator in a more general setting. The 
estimates proposed above are modifications of those given by 
Southward and van Ryzin (1972). 


No structure assumed for of 

We now abandon the assumption that of has the special structure in 
(7.4.8), but we do require that every m t > 2. This approach has the 
advantage that a knowledge of the parametric form of F is not needed. 
The results are therefore applicable to nonparametric cases where we 
assume only that the means 2 f and the variances of are finite. With of 
the usual sample variance of the m ; results in the ith component, the 
estimates are 


d) x = max 


6\ = max 


j^XX - ^ Z m i 


to be used in (7.4.9). 


Example 7.4.1 The problem of linear calibration, discussed also in 
section 8.4.1, provides an interesting example of the application of 
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compound estimation. Suppose that a relation y = a + fix+ e holds, 
where y is a measurement by a ‘quick’ method, x is a measurement by 
a ‘slow’ but accurate method, and e is a random error with zero mean 
and variance a 2 . A calibration experiment yields measurements 
x 1 ,x 2 ,...,x„ by the slow method, and y\,yi->---,y n hy the quick 
method. The problem is to estimate the unknown current x when a 
measurement y by the quick method has been obtained. 

Consider linear estimators 4>{Y) = co 0 + oqY, where Y is the 
random variable of which y is a realization. Suppose that an estimator 
of x is chosen from the class (7.4.1) such that, if applied to the previous 
y,’s, it would result in the smallest overall mean square error. That is, 
choose cu 0 and a) 1 so as to minimize 


i- 1 

where D refers to the joint distribution of Y = (Yj, Y 2 ,..., T„) for fixed 
x = (xj, x 2 ,..., x„). This is a sensible criterion because in practice one 
would generally be concerned with x values falling in the range of 
values covered by the calibration experiment. Also, the estimator that 
is best for the known x’s is certainly a reasonable candidate as an 
estimator for the current x. 

According to the criterion above the optimal linear estimator of x is 


*(T) = x + 


ps xx 

{n/(n-l)}a 2 + PS XX 


(Y- a-Px\ 


where S xx = t (x f — x) 2 /(n — 1). Defining S rr and similarly we 
have 

E d (Y) = a + fix 
Ed$xy = flS xx 
EoSyy = G 2 + fi 2 S xx . 

From these relations estimates of the parameters a, /?, a 2 can be 
obtained by the method of moments. Replacing these parameters by 
their estimates in J?(T), and n/(n — 1) by 1 yields the ‘inverse estimator’ 

*i(r) = x + (s^/s rx )(y-F) 

proposed by Krutchkoff (1967) as an alternative to the ‘classical’ 
estimator 


* c (Y) = x + (S xx /S xY )(Y-Y). 
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The compound mean square error criterion provides a justification 
for the ‘inverse’ estimator. A more detailed discussion of the 
calibration controversy is given in Lwin and Maritz (1982). See also 
section 8.4.1 where an EB approach to the same problem is 
illustrated. 


(d) Optimal non-linear estimators 

One may not wish to start with a class of estimators such as (7.4.6), 
especially if 2, is not the mean of the ith component distribution. If the 
parametric form of F is known it is still possible to obtain an optimal, 
generally non-linear, estimator of which minimizes the compound 
risk defined in (7.4.2). It is of the form 


£iUi2 u {n^i/(*ol2u)} 

2-iffT7ii/(*yMJ} 


(7.4.10) 


The formula above is readily obtained from (1.3.2) by noting that G 
may be replaced by the empirical c.d.f. of the A, so that tT(<5) is formally 
identical to the compound risk (7.4.1). The estimator in (7.4.10) again 
is not practically useful since it depends on the values themselves, 
but its form does indicate how a non-linear compound estimator 
could be derived. Let A; be a standard non-compound estimate of A ; 
depending only on the observations in the ith component. Replacing 
A f by X; in (7.4.10) produces a compound non-linear estimator. The 
estimate of every A f then becomes a weighted average of the 
conventional estimates, and so induces a shrinkage to an average, a 
feature of all compound estimators. 


(e) Performance of compound estimators 

The applicability of compound estimators has been the source of 
much controversy since the appearance of the original James-Stein 
estimator. The justification of the compound estimators is in the 
compound loss. Thus, although the component problems may be 
unrelated, the use of only the information at the ith component to 
construct an estimate of A f , does not necessarily produce an optimal 
result in terms of the compound loss. 

A telling result is that, for the original James-Stein estimator, the 
compound risk is 

K (i, (X + , A) = a 2 [l - E D {(k - 3) 2 /S}], 


(7.4.11) 
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where S is a t 2 r.v. with k — 1 degrees of freedom. This R {k> (i + ,X) is 
smaller than the corresponding risk of the best non-compound 
estimator, i.e. the ML estimator. Much work has been done extending 
the James-Stein estimators to more general situations. The case of the 
exponential family of data distributions has been treated by Hudson 
(1978); an outline of this work follows. 

Let Xj of the ith component have a p.d.f. belonging to the 
exponential family (3.4.5). Then the sample information of the ith 
component is summarized in a sufficient statistic T, also having a p.d.f. 
in the exponential family, say, 

fim = exp {Xf - (7.4.12) 

Suppose that 6 ( = £ F (T f ), i = 1,2, ..., k, are the parameters of interest 
to be estimated. Consider a subclass of (7.4.12) such that 


Ef{(T, - e,)l(Td} = E r {a(T,)nT,)} (7.4.13) 


for some function a(T f ) and for all continuous functions !(•) such that 
£|a(T i )/'(T i )| < oo. Hudson (1978) showed that if the p.d.f. of T t is 
given by 


fit I Aj) = exp 10, J b{t)dt - x(Aj)| b(t) exp | - |fb(t)dt |, 


where b{t) = l/a(t), then it belongs to the subclass defined by (7.4.13). 
Let S = Xf= i(jn(t,)<lti) 2 and define the compound estimator of as 

= T t -(k - 2)|Ja(r i Mr i j / S. (7.4.14) 

Then the compound risk of 0 + under squared error losses is 

R (k \0,9) = R W (T, 9) — E{(k — 2 ) 2 /S}, (7.4.15) 

where R lk) ( T, 9) is the compound risk of T as an estimator of 9. The 
significance of result (7.4.15) is that the compound risk of 9 + is always 
smaller than the compound risk of T, and it is an analogue of the 
James-Stein estimator for the exponential family of distributions. 


(f) Performance of linear compound estimators 
We now give an assessment of the linear compound estimator 
discussed in section 7.4.1(b). In doing so we modify the earlier results 
in two ways. First we use a generic non-compound estimator T t in 
place of Xi and assume that £|?(T f ) = A ; , var f (T ; ) = t ; . Second we take 
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as our starting point the class of linear estimators 

X i = a 0 + a 1 T i . (7.4.16) 

Note that T it and therefore A,, need not be linear in the original 
observations. If T ( = X { we are back to the linear case treated in 
section 7.4.1(b). 

Following the arguments in section 7.4.1(b) the optimal linear 
estimator of X h in terms of T h is 

Xf = X+(l -dM-X), 

where d u = f/(o- 2 + f), where A = t A,/k, f = £*= 1 and, 
g\ = £*= i(X — A) 2 /fc. Now suppose that f f is any consistent estimate 
of t,- from the data of the ith component. Then estimates of the 
elements of d u can be obtained as follows: 

X=f= £ T-JK f = £f„ 

i=l i=l 

and 

6\ = max (T,- T) 2 /k-(k- l)f, 1/fc j. 

Replacing X, r t , a\, f by their estimates in the formulae for d u and Xf 
yields a compound estimator of 

It can be shown that, when minis large and 
r, = 0(l/m,), then X* is preferred to T in terms of compound risk if 
the following inequality holds: 

f> £ (X — X) 2 /k. (7.4.17) 

i = 1 

This is an important result since it gives a guide as to the practical 
usefulness of compound estimation. It should be noted that, even in 
terms of compound risk, the James-Stein type estimators do not 
dominate the non-compound estimators uniformly in general. The 
uniform dominance in the exponential family seems an exception 
rather than the rule. The result (7.4.17) reflects the more common 
situation that the compound estimators dominate the best non¬ 
compound estimators only in a subset of the parameter space. That 
subset is where the spread of the A, values is smaller than the average 
spread of the data about their central values. 

This last property is related to the roles played by the prior variance 
and the data variance in EB methods. In general EB estimates can 
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only be better than conventional ones if the prior variance is smaller 
than the data variance. In this context where the dispersion of the /1.,- 
values is regarded as being brought about by a random mechanism, it 
also seems necessary to assume a physical connectedness between the 
component problems. In the compound decision context it does not 
seem to make practical sense to use a compound loss criterion unless 
there is some such connectedness. 

Example 7.4.2 This example is given by Hudson (1978). Let X t be a 
gamma r.v. with p.d.f. /(x|0,) given by 

In f{x\9j) = (0 l — 1) In x — x — In r(0,-), i = 1,2,..., k, k ^ 3. 

Then T ; = XT= has a gamma distribution of the same form as 
/(*!•) with parameter A,- = m,-#,-. Let 

Xt = T i -(\/S)(k-2)\nT i , 

where S = £•= x (In X i ) 2 . Then the compound estimator X + dominates 
the non-compound unbiased estimator T = (T 1( T 2 ,..., T k ). 

7.4.2 Compound decisions between hypotheses 

Compound decision (CD) theory originated with Robbins (1951), 
who dealt in the first instance mainly with the problem of decision 
between two simple hypotheses. The particular problem treated in 
detail is that of choosing between A = + 1 and A = — 1 in a JV(A, 1) 
distribution, when one observation x f is made at each component. In 
two more recent papers, Copas (1969, 1974) gives a critical review of 
CD theory with special emphasis on its connection with EB methods. 
The view presented is that, of the two approaches, CD seems to be the 
more flexible frame of reference in which to examine decision 
procedures. The following summary of CD theory is based in part on 
these two papers by Copas. 

(a) Decisions between two simple hypotheses 
Suppose that the value of A ; at the ith component is known to be either 
A (1) or A (2) . We have to make a collection of decisions comprising the 
assigning of every unknown A,-, which gave rise to observations x ; , to 
either A (1) or A (2) . 

As a somewhat oversimplified example, consider each component 
experiment to be the random drawing, with replacement, of m items 
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from a batch of fixed size. The proportion of defectives in a batch is A; 
suppose that we know every component X to be either A (1) or A (2) and 
that we are to ‘sentence’ batches accordingly. Every observed x„ 
i=l,2,...,k, is a realization of a Bin(m,A) r.v. with A = A (1) or X = A <2) . 
We emphasize that the sentencing is not done sequentially and that 
the entire sequence of batches is sentenced simultaneously when all 
sample results are to hand. Sequential sentencing is a different case 
and will be treated later. 

Since we make the k decisions simultaneously we may regard 
the problem as one of selecting a parameter point (X\,X' 2 ,...,X' k ) 
from a space ft comprising the 2" points (A (1> ,A (1) ,...,A (1) ),..., 
(X (2 \X (2> ,X {2) ). One standard method of doing this is to maxi¬ 
mize the likelihood L(x,A) of the observations x = (x 1 ,x 2 ,...,x t ) T 
w.r.t. X in the space ft. Since 

L(x,k)=flf(x i \X i ), (7.4.18) 

i= 1 

it is maximized by maximizing every individual /(x;|A f ). This is 
equivalent to comparing /(x i |2 <1) ) and /(x ; |A (2) ), i.e. using the 
likelihood ratio criterion 

*i = /(x i |A< 1 >)//(x i |A (2) ), i = 1,2,..., k. (7.4.19) 

This means that the decision rule is non-compound. The rule is also 
symmetric in the sense that every batch is sentenced according to the 
same criterion. 

One can recast the problem in the decision theoretic framework. 
For a typical component we partition the sample space into two 
regions A 1 and A 2 and our rule, <5(x f ), is to choose A (1) when x^A^ and 
A (2) otherwise. Here we take the loss to be L(S, X) = 0 if the correct 
decision is made and L{5, A) = 1 otherwise. For the entire sequence of 
decisions the total expected loss is then 

E d t L{6(x i ),X i }/k = E D C(8) 

i= 1 

= (MJk) f f(x\X <l) )dx 

J A, 

+ (MJk) I /(x|A (2> )dx, (7.4.20) 

J a 2 

where M t and JVf 2 are the numbers of A,’s having the values A U) and 
A (2) respectively. The symbol E D indicates expectation with respect to 
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the joint distribution of X 1 ,X 2 ,...,X k given X k ,X 2 ,..., A*. 

Let 9 = M 1 /k. Then (7.4.19) can be rewritten formally as the 
expression for 1T(<5) in section 1.4. Hence the optimal A 1 can be 
constructed using the argument of section 1.4 and the rule is readily 
found to be 

<5 t (x): accept X = A (1) if x (7 4 21 ) 

accept X = X <2) otherwise, 

where £ k is the solution of the following equation in x: 

0/(x| A‘ 2 >) = (1 - ^/(xU* 1 *). 

Clearly 5 k generally depends on 9, and this means that the likelihood 
ratio method will not necessarily produce an optimal rule. 

In practice 6 will generally be unknown, making determination of 
the optimal rule impossible. However, it is possible to obtain 
information about 0 from the set of observations x i( i— 1 , 2 , 
Suppose that 9 is estimated by 9 and that 9 is replaced by 9 in the 
construction of the optimal decision rule. If 9 is close to 9 it is not 
unreasonable to expect that the compound risk resulting from such a 
rule might be close to the smallest risk. A rule of the kind under 
discussion will generally be a compound rule. The question of 
closeness of the risk of the CD rule to the optimal risk will be taken up 
in section 7.4.2(c). 

We conclude this section by looking at an example studied in detail 
by Robbins (1951). Let f{x\X) be the N(X, 1) density. Then straightfor¬ 
ward calculations yield 

£ k = (1 (1) + A <2) )/2 - (A (2) - 2 U) )ln {(1 - 9)/9} 

where it is assumed that X w ^ A U) . Now let X = (Xj + X 2 + —I- 
X k )/k. Then E(X) = 9X (1) + (1 — 9)X {2) , and var(A) = \/k. Hence a 
reasonable estimate of 9 is 9 = (x — X i2) )j(X {1) — X ay ), truncated to lie in 
the interval [ 0 , 1 ], and <f t is obtained by replacing 9 by 0 in the formula 
defining £, k . By analogy to Example 1.4.1 the compound decision rule 
becomes 

if x ^ A (1) choose a <!) for all x 
if x ^ / (2) choose A <2) for all x 

if A (1) < x < A <2) choose 2 (1) if x < 

choose X i2) if x > £ k . (7.4.22) 


which coincides with (5.2.10). 
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Regarding the CD rule embodied in f for any given f the total loss 
is 


kC({) = #"+■■■ + e«\ + <?< 2 > + • • • + e%\, 

where the 4° = 1 or 0 as xP ^ or < f* i = 1,2,..., Af,, j= 1,2 where 
*P are observations generated by 2 U) . Since 


£ x (e| 1) ) = 

E x m=pix?'<a 


(7.4.23) 


and the xP, i= 1,2,..., Mj are identically distributed for each j, we 
have 


E x C(L) = ~P[A l) > fj + ^ P[x?> < &]. (7.4.24) 

Without writing down a more explicit expression for the r.h.s. of 
(7.4.24) it is easy to see that, since f k -»£ k , in probability, as k -► oo, 
ExC(£k) E x C(£, k ). Robbins (1951) has given an expression for 
E x C(i ) which can be evaluated by numerical integration. A Monte 
Carlo method of estimating £*C(f k ) is easily developed. For a given 
set of A (1) and 2 <2) values (i.e. given M, and M 2 ), a sequence of 
observations, x x , x 2 ,..., x k , is generated by sampling from appropri¬ 
ate normal distributions. Computation of f k and C(f k ) is then 
straightforward. Repetition of this process, of x-sampling with the 
same 2’s, yields a sequence of C(£) values whose average is an estimate 
of £ x C(f). Table 7.1 gives the results of such calculations, in the case 


Table 7.1 Compound decisions between H x : 
A= —1, /f 2 :2= + 1 in the JV(A, 1) case. 
Expected proportions of wrong decisions 
are tabulated for k = 100, k = 10, when 
0, =P(A= — 1) = 01. 0-2, 0-3, 0-4. 0-5 
[Robbins (1951)] 


01 

£*C(-) 

fc= 100 fc= 10 

01 

0076 

0-11+001 

0-2 

0-117 

0-15 + 0-01 

0-3 

0-144 

0-22 + 002 

0-4 

0159 

0-21 + 0-02 

0-5 

0-163 

0-20 + 0-02 
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A (1> = — 1, A (2) = + 1; the results for k = 100 are those given by 
Robbins (1951). 

The example we have just considered is a rather simple one of the 
type involving two simple hypotheses, because the normal distri¬ 
bution belongs to the ‘monotone likelihood ratio’ family. In other 
cases A 2 and A 2 may be more complicated, but the principles remain 
the same. 


(b) Two composite hypotheses 


Introduction 

Instead of restricting the values of the parameter to only two possible 
numbers, A (1) and A (2) , we now allow the k parameter values 
X lt X 2 ,...,X k to be an arbitrary set of numbers. Two composite 
hypotheses are represented by a partition of the A-space into coj and 
<u 2 . Our task is to assign every Xj to H 1 :Xeco 1 or H 2 :Xeto 2 , on the 
basis of Xj, and allowing the possibility that every decision may also 
be influenced by the observations other than Xj alone. 

Following the approach of section 7.4.2(a) we examine the conse¬ 
quences of employing a simple rule, <5(x), by which we select 


and 


Hi when XjsAj 
H 2 when XjeA 2 


for; = 1,2,..., k; A t and A 2 remaining fixed. Let L(S(x), A) denote the 
loss on making decision <5(x) when the parameter value is A. Then the 
total loss on making the decisions for the entire set of observations is 


kC(S) = L[S(x 1 ),X l ] + - + L[S(x k ),X k 3, 


and the expected average loss in repeated x-sampling with fixed 
X\,X 2 ,.. A* is 


E X C(5) 




L[S(x J ),Xj']dF(x-j\Xj) 


= J j L[<5(x), A]dF(x| X)dG k (X). (7.4.25) 

In the expression (7.4.25), G t (A) is a d.f. with jumps of magnitude l/k at 
the points A (1) ,..., A (k) , these being the parameter values A r ,..., A ft 
arranged in order of magnitude. 
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Our aim is to determine 8(x) such that E X C(8) is minimized, and 
clearly we should be able to do this by the established Bayesian 
techniques if G k (A) were known. Knowledge of G*(A) in this case is 
equivalent to knowledge of Mj and M 2 in the case of two simple 
hypotheses, treated in section 7.4.2. In the more general case it is 
tantamount to being given the set of values k k without being 

told with which Xj any given Aj is associated. In practice such 
knowledge is usually not forthcoming; we have seen in the simpler 
version of this problem that M 1 and Af 2 are generally unknown. 
However, it also appeared that M k and M 2 could be estimated using 
the observed x-values, and the question now is whether the same 
possibility exists in the more general case, and how it may be 
exploited. 

Define 


when x t < x 
otherwise, 

where x is an arbitrary x-value. Then 




Putting 


P(e f = 1) = P(x f < x) = f dF(Xj|Ai). 

J - oo 


F k (x)= £ ejk 


i= 1 


= (the number of x-values < x)/k, 


we see that 


l k r* 


dF( Xi \A t ) 


dF(Xi\A)dG(A) 


= F Gk (x). 


Remembering that var (F*(x)) < l/(4/c), we see that F k (x) is, for every x, 
an unbiased estimate of F Gk (x). 

We emphasize that F k (x) and G Ck (x) and the relationship between 
them are obtained by a somewhat different argument from that 
occurring in the relationship between F„(x) and F G (x) in the EB 
problem. Nevertheless, F k (x), F Gk (x) and G k (A) have the mathematical 
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properties of d.f.s, and the connection between them is directly 
comparable with the relationship between ( F„ , F G , G ) in the EB case. 
An immediate consequence is that the procedure developed for 
estimating G in the EB case can be applied here. There is this 
difference: the X u ...,X k are not necessarily regarded as originating 
from a certain population of A’s from which they are obtained by 
random sampling. Thus the V-smooth’ method would appear to be 
more appropriate than the ‘parametric smooth’ method. However, if 
the CD problem arises in circumstances where the X's can be regarded 
as random observations from a population whose d.f. belongs to a 
certain parametric family, then a member of that family may possibly 
be used advantageously as an approximation to G. This possibility 
has not been explored further. In the following section we give the 
results of a study of the problem of decision between two composite 
hypotheses. 

A particular case using the ‘0 — T loss structure 
Let the two hypotheses be H t : X < X Q or H z . X > A 0 , the loss being 1 
when the wrong decision is made, and 0 otherwise. The Bayes solution 
to this problem is outlined in section 1.4. When 

R(X)= P° f(x | X)dG k (X) I r f(x\X)dG k (X) 

is monotonic in x, the regions of acceptance for //, and H z become, 
xe[ — oo, £ G J and xe[^ Gk , + oo], where £ Gk is a Bayes ‘cut-off for x. 
Thus, if x < l Gk accept H { and if x ^ £ Gk accept H 2 . The value of £ Ck is 
determined by finding x such that 

f 10 f(x\X)dG k (X) = f* f(x\X)dG k (X). (7.4.26) 

J — co J A o 

Such an x can be determined if the l.h.s. and r.h.s. are continuous 
functions of x, and x takes all values. When AT is a discrete r.v., a 
suitable convention must be adopted. 

When G is not known, but is approximated by G* (see section 2.9), 
we obtain an estimate of G* by maximizing. 

(1/k) £ log{/ G .(Xi)}. 

i — 1 

The motivation for this procedure is exactly the same as before, 
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namely minimization of a ‘distance’ between F(x) and F G (x); the 
reader is referred to section 2.9. We denote the estimated G* by G*, 
and on substituting G* for G k in (7.4.25) a ‘compound’ cut-off, £ is 
obtained. Thus the CD rule is 

accept H 1 if x t < <f, accept H 2 if x,<f, i= 1,2,..., 

When f(x | A) is such as not to yield a monotonic R(x), the process of 
finding the Bayes and CD rules may be somewhat more cumber¬ 
some, but it will not be different in principle. The approximation of G 
by G* will be carried out in the same way, and H k or H 2 is then chosen 
according to (7.4.25), with G replaced by G*. 

Example 7.4.3 Let /(x|A) be the N(X, 1) density. We have seen in 
section 1.5 that R(x) is monotonic in x for this case. Determination of 
the compound decision rules follows the methods of Chapter 4, and 
no further details are needed here. Numerical results relating to this 
example are given in section 7.4.7. 


7.4.4 Risk convergence 

The risk convergence criterion was introduced for compound deci¬ 
sions as an analogue of asymptotic optimality in EB theory; see 
Robbins (1951), Samuel (1965). For a typical CD rule, 8, the ith stage 
decision is <5 f (x) with the corresponding average compound risk 
R ik) {8 (k) , A} as defined in (7.4.2) with 8 (k> = (8 1 ,8 2 ,...,8 k ) in the place 
of X. Suppose that the parameter space of every component A ; is Q. 
Then the parameter space of A is C2 <k> , the k-fold Cartesian product of 
Q. Let G k be a distribution over Q (i) which assigns probability s/k to 
XeQ <k) when A ; = A for s values of i; i = 1,2,..., fc; s = 0,1,..., k. Then 
for a typical component, the average risk is 


R(S,X) = 


E F L{d(x),X}dG k (X), 


(7.4.27) 


where F is the d.f. of X for given A. Thus the optimal estimator for the 
component problem, i.e. minimizing R(S, A), is the ‘Bayes’ decision 8 k 
with respect to the ‘prior’ G k , and let this minimum average 
component risk be R(8 k ). 

Now let the parameter space Cl {k) be partitioned into equivalence 
classes by means of the empirical d.f. G k and suppose that Z(G k ) be the 
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equivalence class of all AeQ (k) which have the specified H as their 
empirical distribution. If S k is used at every component problem it 
incurs the average component risk R(S k ) at every XeZ(G k ). No simple 
rule S (k) exists which satisfies 

sup R (k >(S (k >,X)^R(S k ) 

S m eZ (G k ) 

unless <5, is some version of S k in which case equality holds (Samuel, 
1965). This means effectively that non-compound rules cannot 
minimize the average compound risk, thus justifying compound rules. 
Samuel (1965) also advanced that CD rules which are risk convergent 
should be sought. 

The risk convergence property: A compound decision rule is said to 
be risk convergent if for every infinite sequence A 0 = (A 1( k 2 ,. . ■) and 
every e > 0 there exists an inter N(X 0 , e) such that for all k > N(X 0 , e) 

R ik) (S ik \X lk) )-R(8 k )<e, 

where X (k) is the initial k -vector of A 0 . Samuel (1965) showed that, 
under fairly general conditions, no non-compound rule exists which 
possesses the risk convergence property. 

There has been much work published on the rates of convergence of 
risk-convergent rules. The direct relevance of this type of study to 
practical applications is not obvious, and will not be dealt with in any 
detail here. 


7.4.5 Decisions between q^2 hypotheses 

In this section we consider a more general formulation of the 
compound decision problem. The r.v. X, has p.d.f. f(x\ X t ) depending 
on the parameter X t , i= 1,2,...,fc. Each component parameter X t 
belongs to one of the regions which constitute a 

partition of the common parameter space fi of the X t . Thus we have a 
set of q composite hypotheses H r : A,efi r , r — 1,2,..., q for each A ; . On 
the basis of the observed x t the unknown A,- must be assigned to one of 
Q^Dj,...,^, i.e. one of H l ,H 2 ,- -,H k must be chosen. 

Let <5(x) be a non-compound decision rule which partitions the 
sample space of X into regions A 1 ,A 2 ,...,A q such that Hj is chosen 
when xeAj. The loss incurred by using d(x) when the parameter value 
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is A is L{S(x),X}. The expected loss in repeated x-sampling is the 
average compound risk, 

R*W\V = k- 1 E D £ L{S(X 

t= 1 

where, as in section 7.4.2(b), G k is the empirical d.f. of the A, . To obtain 
the optimal non-compound CD rule we find the ‘Bayes’ rule with 
‘prior’ G k . Formally this is the same as the EB problem with G k in 
the place of G. 

As in other CD problems, the key to constructing applicable CD 
rules is a satisfactory method of estimating G k . Once such an estimate 
has been found it can be substituted for G k to derive a CD rule. Risk 
convergence and rates of convergence have been discussed by van 
Ryzin (1966a) in this setting, generalizing work on two simple 
hypotheses by Hannan and Robbins (1955) and Hannan and van 
Ryzin (1965). 


7.4.6 Compound decision rules; sequential case 


(a) Introduction 

Almost every CD rule that has been developed has a sequential 
counterpart. The work of Samuel (1965) dealing with the non¬ 
existence of non-compound rules that are risk convergent also covers 
the sequential case. One reason for this simultaneous development is 
the perceived practical potential of the sequential procedures. 

Let us imagine that, instead of all k observations being in hand 
before the decisions regarding the A, values have to be made, the 
results x t , x 2 ,... are obtained sequentially. At every stage the decision 
<5(x ; ) has to be taken immediately. We are still interested in 
minimizing the collective expected risk over all k decisions. Denote by 
<5 k (x) the ‘Bayes’ rule for minimizing the expected loss when all 
decisions are made simultaneously. Suppose now that the rule 6 k _ t (x) 
is used for the first k — 1 decisions, while ^(x) is employed for the Nth 
one. The total expected loss is then 


J~C[(5 k _ 1 (x J ), Xj-\dF(Xj\Xj) + ^L[_6 k (x k ),X k -]dF(x k 


14 ) 




L[S k (x j ),X i W(xj\X j ) + 


LLdkiXk), ^sldF(x k | X k ) 


= kE x C(S) kE x C(T), 


(7.4.28) 
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where T is a conventional rule. Repeated application of (7.4.28) shows 
that if the new Bayes rule is used at every stage, the resulting overall 
expected loss will not exceed either E x C(d k ) or E X C(T). Observe that 
the symbol C( ) has the same meaning as before. Thus, denoting the 
sequentially adjusted Bayes rule by ‘seq. S k \ equation (7.4.28) states 
that 

£*C(seq. S k ) < E x C(S k ) ^ E X C(T). 

In practice, implementation of a sequential Bayes rule may usually 
be regarded as an impossibility, but the result given by (7.4.28) has 
significant implications for the CD problem. They may be stated as 
follows: if S k is a non-sequential CD rule such that 5 k ->S k , in 
probability, as k -* oo, then, for k sufficiently large, E x L[5 k (xj), X^] will 
be close to E x L[5 k (xj),Aj]. Now suppose we use the rule T for all 
decisions when j < n 0 and the CD rule dj for = n 0 + 1, n 0 + 2,..., k. 
Then if n 0 and k are large enough, with n 0 /k decreasing as N increases, 
the effect on the overall expected loss of using T for the first n 0 
decisions will become negligible as k increases, while the expected loss 
for the last k — n 0 decisions will be close to the ‘sequential Bayes’ 
expected loss for these decisions. The combined effect will be to 
produce an overall expected loss not exceeding E X C{T). 

(b) Two simple hypotheses 

We consider the case of two simple hypotheses based on the N(X, 1) 
kernel distribution with fixed values X {1) and / (2) of X representing the 
two hypotheses. Let the sequential parameter values be Af, A*,..., 
where every X is either A (1) or / (2) . In this case the rule S k refers to the 
Bayes cut-off £ k , individual losses are e[ l) and ej 2) as defined above. 

Let m k denote the number of A (1) values among the first n values 

Xf,j = 1,2. n. For any = m k /n and arbitrary small e, e' > 0 we 

know that 

|x - OjA' 1 ’ - 9 2 X {2) | < e with probability (1 - s'), 

for n ^ n(e, e', 0 t ). Hence, noting that the expected losses are given by 
(7.4.23), that the correlation between any x, and x->0 as n -* oo, and 
that the losses are bounded by 0 and 1, we have 


| E x L[i 5„(x„), Xn ~ E x L[S„(x„), AJ]| < e"(e,e',e k ) 
for n nfoe',#!), where s"(e, e’,6^-^0 as e,e'-»0, Pnttimr n 0 = 
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max 6l n(s,e', 0 t ), we have 

\E x LL$ n (x„),Xn ~ E x L[S n (x n ), /*]| <£q, 

uniformly for n > n 0 . 

Now let the rule T be used for the first n 0 of our k decisions. Then 
the average expected loss over all k decisions is at most 

no 

Z {E x L[T,X*-] - E x L[Sj,Xf]}/k + E X C( seq. S k ) 
j= i 

+ (k — n 0 )£'o/k < E x C(S k ), for k sufficiently large. 

This also implies, of course, that E x C(seq. S k ) < E X C{T), for k large 
enough. 

Samuel (1963, 1964), has studied this case in detail, emphasizing 
the practically very important fact that the order in which the X 
values occur does not affect the conclusion that the sequential 
CD rule will be ‘better’ than a conventional rule for large k. Some 
numerical results relating to the performance of the sequential CD 
rule are given by Samuel (1964). 

(c) A finite number of simple hypotheses, composite hypotheses 
Since the formulation in section 7.4.6(a) is not restricted to two simple 
hypotheses, it is clear that the ideas of section 7.4.6(b) can be extended 
to the case of a finite number of simple hypotheses. Essentially the 
problem revolves about the determination of consistent estimates of 
the proportions in which the different parameter values occur, i.e. of 
G k (Samuel, 1966). 

In the case of the composite hypotheses where the parameter values 
are not restricted to a finite number of given points, the arguments 
applicable to the case of a finite number of hypotheses can still be 
used. If the parameter values are restricted to a finite interval an 
extension of the ‘finite’ argument, using standard limiting processes, 
can be used to show that a sequential CD rule will be better than the 
conventional rule for large k provided a sequential rule can be 
developed. Since the main question is again the estimation of G p 
j = n 0 + l,...,k the idea of approximating G p developed for EB 
procedures, can be used (Maritz, 1968). Restriction of the parameter 
values to a finite interval means, in practice, that the sequential CD 
rule will be relatively ‘good’ if the ultimate spread of parameter values 
is small relative to var(x|/). The next section gives some numerical 
data for a case of two composite hypotheses. 
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7.4.7 Performance of CD rules in the case f(x\X) = N(X,l) 

(a) Nonsequential case 

Two series of examples have been prepared. Two finite populations of 
A (i) , i= 1,...,100, were generated by sampling from N( 0,cr 2 ) popu¬ 
lations with a 1 = OT, 0-5. The point A 0 , representing the division 
between H k and H 2 , was chosen such that the proportion, G k (A 0 ), of 
A (j) < A 0 was successively 0-5,0-6,... ,0-9. The optimum (Bayes) cut-off 
Z Gk was determined by finding x satisfying 

Z / Z f( x Mu)) = 2- 

Sets of observations x lt ...,x k were generated repeatedly by 
sampling from iV(A (0 ,1) populations, i = 1,..., k, and, for every set, 
estimates A 1; ...,A 5 were obtained by ML. The actual losses using 
Is, £ g and T = A 0 were found for every set of observations and the 
expectation E X C(£ 5 ), of the loss in repeated x-sampling, the A f 
remaining fixed, was estimated by averaging observed values of the 
losses. Table 7.2 gives a summary of the estimated values of 

E x C(£ k ), E x C({ Gk ) and E X C(T). 

For the larger G k (X 0 ) we see that E X C(£ S ) is reasonably close to 
E x C(£ Gk ), and less than E X C(T). However, for smaller G k (A 0 ), when 
E x C({ c ) is close to E X C(T ), it may not be advantageous to use the 
smooth CD approach. 


Table 7.2 Results for nonsequential CD decision cases with k = 100 and values 
A 1 ,...,X k , obtained by sampling from N(0,a 2 ) populations; F(x\X) = N(X, 1) 


G*(A 0 ) 

£{C(£ 5 )} 

Em Gk )} 

E{C(T)} 

a 2 = 01 0-5 

0-472 + 0 004 

0-417 + 0-006 

0-417 + 0-005 

0-6 

0-454 -f 0-009 

0-388 + 0-004 

0-417 + 0005 

0-7 

0-365 + 0-015 

0-301 + 0-002 

0-406 + 0-006 

0-8 

0-252 + 0-015 

0-200 + 0-001 

0-378 + 0-006 

0-9 

0-116 + 0-006 

0-100 + 0000 

0-339 + 0-005 

<j 2 = 0-5 0-5 

0-353 +0-013 

0-310 + 0-007 

0-310 + 0-007 

0-6 

0-314 + 0-011 

0-278 + 0-008 

0-291 + 0-009 

0-7 

0-279 + 0-011 

0-256 + 0-007 

0-287 + 0-006 

0-8 

0-234 + 0-018 

0-186 + 0006 

0-270 + 0-010 

0-9 

0-105 + 0-006 

0-098 + 0001 

0-214 + 0-010 
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(b) Sequential case 

Parameter values were obtained by sampling from (i) a N(0,a 2 ) 
population and (ii) a population with the exponential d.f. 

G(A) = 0 for A ^ 0, G(A) = 1 — e _AB for A > 0. 

In theory the A (1) ,..., A (k) can be a quite arbitrary collection of 
numbers, but the populations which we have used are thought to 
represent reasonable practical ‘extremes’. 

Because of lengthy computations required when recalculating £ r „ 
for every additional observation, it was decided to adopt a scheme of 
‘multiple sampling’ in our examples. Let us suppose that k = nm. Then 
a ‘good’ conventional decision rule is used for the first n observations. 
After obtaining the next n observations £ r2 „ is computed and used for 
the second group of n observations, and so on, until <f r is used for the 


Table 7.3 Results for sequential CD cases with k = 50, values /. l: .. ,,A k , 
obtained by sampling from distributions (i) N( 0,cr 2 ), (ii) G(A) =1.— exp 
(— A/<x), A > 0; A (1) ,..., A (t) , are randomly ordered; F(x, A) = iV(A, 1) 



G*(A„) 

£{C(seq. {„)} 

£{C(£ Gt )} 

E{C(T)} 

© 

II 

to 

0-5 

Distribution (i) 

0454 ± 0020 0412 ± 0014 

0-421 +0-013 


0-6 

0435 + 0027 

0382 + 0-012 

0-404 + 0-015 


0-7 

0339 + 0 021 

0-297 + 0-009 

0398 + 0-016 


0-8 

0-238+ 0-017 

0198 + 0002 

0360 + 0-017 


0-9 

0169 ±0-013 

0-100 ±0-000 

0-323 ±0-013 

a 2 — 0-5 

0-5 

0380 + 0-014 

0-315 + 0-019 

0-315 + 0-019 


0-6 

0-372 + 0-017 

0302 ± 0-017 

0-316 ±0-017 


0-7 

0336 + 0-016 

0247 + 0-010 

0307 ±0-018 


0-8 

0234 + 0-016 

0186 + 0-007 

0-281 ±0-014 


0-9 

0132 + 0-015 

0100 + 0-003 

0-251 ±0-009 

© 

II 

rs 

to 

0-5 

Distribution (ii) 

0-454 + 0-011 0-407 + 0-013 

0-407 + 0-012 


0-6 

0-481+0-019 

0-380 + 0 009 

0-400 ±0-015 


0-7 

0424 + 0026 

0-315 + 0-005 

0400 ± 0-012 


0-8 

0-279 + 0-017 

0201 + 0 001 

0382 + 0-016 


0-9 

0184 ±0-021 

0-100 ±0-000 

0349 ±0-015 

a 2 = 0-5 

05 

0350 + 0-017 

0326 + 0-014 

0-329 ±0-015 


06 

0354 + 0-021 

0-287 + 0-016 

0322 ± 0-017 


07 

0294 + 0020 

0235 + 0-014 

0-271 ±0-016 


08 

0255 + 0-022 

0169 + 0009 

0260 + 0016 


0-9 

0112 + 0-005 

0088 + 0006 

0166 + 0-011 
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last n observations. In the light of (7.4.28), this scheme may be 
expected to give slightly worse results than the sequential scheme 
in which „ is recalculated with every new observation. 

In our examples, populations with k = 50 were generated as 
indicated above and we have used n = 10. In case (i) we used a 2 = 0-1 
and <t 2 = 0-5, and in case (ii) 1/6 = 01 and 1/6 = 0-5. As before, 2 0 was 
selected to give G t (2 0 ) = 0-5, 0-6,...,0-9, respectively. The entries in 
Table 7.3 are Monte Carlo estimates of £ x IT(seq. £ r k ) obtained by 
repeatedly generating new x-values by adding new random normal 
deviates to the fixed 2 (J) . The actual losses were computed for every 
new set of observations using, in each case, the new sequence of values 
of <f rn . The estimates given in Table 7.3 are averages of these losses. 

From (7.4.28) we see that the greatest differences between 
£ x C(seq. Bayes) and E x (£ Gk ) will occur when the G„ for the subpopul¬ 
ations of the 2 (1) , ...,X (k) differ most from G k . One obvious way in 
which we can produce such differences is to arrange the X U) in order of 
magnitude. The results of Table 7.4 were obtained on this basis, 
exactly the same X U) being used as before. Comparison of Tables 7.3 
and 7.4 show that the values of E X C (•) for the non-random sequences 
of X (J) tend to be lower. The trend is more noticeable for the large 
values of the variance of the population of X (J) , as one may expect. The 
importance of these results from the point of view of application of 
smooth sequential CD procedures in problems of acceptance sam¬ 
pling, when there may be systematic variation in quality is obvious. 

Table 7.4 Results for sequential CD cases with. k = 50, values 2 1 ,...,2 l , 
obtained by sampling from distributions (i) N( 0, o 2 ). (ii) G(X) = 1 — exp 
( — 2/(7), 2^0; 2 (1) ,..., 2 (k) arranged in increasing order of magnitude; 
£(x|2) = N(2,l) 



G k (2 0 ) 

£{C(seq. <f 5 k )} 

G*(2 0 ) 

£{C(seq. |j ik )} 


Distribution (i) 

Distribution (ii) 

© 

II 

to 

0-5 

0336 + 0-026 

( 7 2 = 0-l 0-5 

0-421+0-035 


0-6 

0-319 + 0023 

06 

0399 + 0-038 


0-7 

0-338 + 0-015 

07 

0348 + 0-024 


0-8 

0-268 + 0-013 

08 

0279 + 0-012 


0-9 

0179 + 0-018 

09 

0186 + 0-018 

a 1 = 0-5 

05 

0-280 + 0027 

in 

© 

m 

o 

II 

<N 

to 

0-310 + 0-024 


06 

0-267 + 0-027 

06 

0271 + 0-023 


07 

0-249 + 0-015 

07 

0-217 + 0-014 


08 

0190 + 0-015 

08 

0138 + 0-012 


09 

0116 + 0009 

09 

0100 + 0-010 
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7.5 General discussion 

While the different approaches mentioned above make use of different 
probability models, each of them treats the same data set, i.e. that 
generated in the EB sampling scheme. They can be placed in a 
spectrum of inferential techniques for data of this sort. At one extreme 
there is the purely frequentist approach represented by compound 
decision theory. It can be regarded as the most flexible, assuming no 
prior distribution of the parameters. The only requirement is 
knowledge of component loss functions and the convention of taking 
their average as a reasonable compound loss for simultaneous 
decision making. At the other end of the spectrum is the full Bayesian 
(FB) treatment of the EB scheme. Here a special form of prior 
distribution is assumed not only for the unobserved X in the second 
stage of the EB scheme, but also for the hyperparameters of the prior 
distribution of A. Exact FB solutions are generally complicated and 
seem to be not directly useful for practical applications. However, an 
approximate general solution is available which obviates the need to 
specify the hyper-prior distribution explicitly. This approximate 
solution turns out to be formally the same as the EB solution. A 
second approximate solution that has been used, especially in the 
linear model situation, employs a non-informative hyper prior. The 
usual problems associated with choosing non-informative priors 
remain. 

The EB method and other alternatives to it may be regarded as 
falling somewhere in the middle of the spectrum, being essentially 
compromises between the two extremes. The two likelihood based 
methods, modified maximum likelihood (MML), and empirical 
regression estimation (ERE), are very similar to each other, the latter 
being historically important because of its early appearance. These 
two methods do not employ loss functions and they are not decision 
theoretic in nature. They can be unified as special cases of a more 
general method which uses a summary function of the posterior 
distribution of A. They do not assume a distribution for the 
hyperparameters, and produce results which are virtually the same as 
the corresponding EB solutions. Indeed, the inferential EB approach 
which does not use loss functions is practically identical to these 
likelihood-based methods. 

While the CD approach is attractive from the frequentist point of 
view some statisticians have difficulty in accepting it because detailed 
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studies have hinged on the risk convergence property which is not 
seen as having a natural practical relevance. The CD rules are found 
to be formally identical to EB rules, the main difference being in the 
criteria on which their performance is judged. Using a CD rule for a 
particular decision implies using in it observations whose distri¬ 
butions do not depend on the parameter of that problem. This latter 
aspect is unsatisfactory from the viewpoint of likelihood inference and 
the notion of sufficiency; see Copas (1969). 

The crucial notion in the FB approach is exchangeability in the 
parameter sequence. It has been argued that in many simultaneous 
decision problems the parameters could be exchangeable in that the 
prior opinion of any particular parameter is the same as that for any 
other member of the sequence. Lindley and Smith (1972) exploited the 
fact that one way of ensuring an exchangeable distribution for A is to 
take it to be of the form (1.14.9). The distributions G(k\<j>) and 
are arbitrarily chosen. Initial developments deal mainly with normal 
data, prior and hyper-prior distributions, but wider applicability has 
been demonstrated. The exponential data distribution was treated by 
Deely and Lindley (1981) who showed also that the usual EB 
solutions with parametric assumptions for G(A| <f>) can be regarded as 
first-order approximations to the FB solutions. The possibility of 
second-order solutions has been indicated. So far no comparative 
study has been made of EB and FB solutions in terms of the average 
risk that is commonly used in the assessment of EB rules. 

Finally we reconsider briefly the circumstances in which one may 
seriously regard an EB approach as a competitor to more conven¬ 
tional analyses. The simplest case of single observations x ( on 
normally distributed random variables occurring at realizations A, of 
A, i = 1,2,..., n, gives a guide. If var (X ; | A f ) = a 2 and var (A) = the 
arguments leading to (7.4.17) indicate that one needs o 2 ><?$. Clearly, 
if a% is zero, or close to zero, pooling of the data and estimating every 
A; by the grand mean is indicated. On the other hand, if the prior 
distribution is very diffuse little can be gained by a Bayesian 
approach. Aside from these considerations there is also the matter of 
the amount of previous data available for constructing the EB rule. 
This question has been dealt with in detail at various points. 
Essentially the EB rules obviously may not be good unless they are 
reasonably accurate estimates of the Bayes rules. 
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Applications of EB methods 


8.1 Introduction 

In this chapter we present summaries of some of the applications of 
EB methods that have been published. In most branches of statistics 
there are data sets that are used repeatedly to demonstrate new 
techniques or modifications of new ones. The reason is that those data 
sets are considered to be particularly suitable in the sense of being 
generated by processes closely approximating the models on which 
the techniques are based. The same is true of EB methods. 

Data from non-trivial practical applications seldom follow the 
somewhat idealized patterns assumed in the development of techni¬ 
ques. Here again, the EB approach is no exception. The EB sampling 
schemes which have been studied in much detail in previous chapters, 
and in a large body of other EB literature, are relatively simple, and 
few actual data sets follow those patterns exactly. This does not, of 
course, make the study of those simpler schemes useless. The results of 
those studies, while not necessarily providing answers in particular 
practical problems, do give valuable guides as to the potential 
usefulness of the methods, and in this sense, provide qualitative 
answers. 

The examples which have been selected show that practical 
situations approximating the EB sampling scheme do arise, and they 
are individually interesting because they illustrate aspects of the 
theory. Morris (1983b) surveys some important applications of EB 
methods; some of the examples quoted by Morris are also given in 
some detail in this chapter. Chapter 2 deals with mixtures, a large 
topic in its own right, and we could have included many examples on 
mixtures. By and large, only examples with a clear empirical Bayes 
connotation have been included, i.e. examples in which decisions 
about individual components are of interest. Take as an illustration 
the case of growth curve analysis of which an example is included. If a 
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study is aimed at searching for or quantifying differences between 
groups of subjects it is not clear that EB methods are relevant. On the 
other hand, individual estimates may be important, for instance if 
they are to form part of the basis for clinical advice; the example of 
Berkey (1982) is a case in point, and others are easily identified. 

We have selected several examples showing original data sets in 
enough detail to make possible the actual calculations associated 
with EB estimation. They have been selected partly for being of 
manageable size, for illustrating points of methodology, or for being 
intrinsically interesting. Some good examples are quoted without 
data sets, either because they are not readily obtainable or because 
they are simply too large for inclusion. 

In many practical cases there will be concomitant information 
about the parameter values. Specifically, recall the EB sampling 
scheme where we have observations (x 1 ,x 2 ,...,x,) when the para¬ 
meter values are (Aj, X 2 ,..., 2„). Every x, is usually thought of as an 
estimate of the corresponding X t . Now it may happen that we also 
have associated with every x f an observation c, on a concomitant 
variable C. Every c, is not necessarily an estimate of X t but C and A 
may not be independent, so that taking account of the observed c 
should improve the estimate of X. However, the emphasis is still on 
estimating individual X values, and not on exploring the relationship 
between A and C, for instance, through the regression of A on C. The 
paper by Tsutakawa, Shoop and Marienfeld (1985) pays some 
attention to concomitant variables in the EB context. Other examples 
involving concomitant variables are discussed by Raudenbush and 
Bryk (1985), and Fay and Herriot (1979). Some details of the use of 
concomitant information are given in section 4.7. 


8.2 Examples with normal data distributions 

8.2.1 Law school validity studies 

Most of the material of this section paraphrases a paper by Rubin 
(1980) which treats some aspects of data supplied by the Educational 
Testing Service (ETS) in the USA in yearly reports to law schools 
participating in a selection procedure. This analysis concentrates on 
the problem of predicting the first-year grade average in law school 
(FYA) for applicant students from their Law School Aptitude Test 
(LSAT) scores and their undergraduate grade point averages 
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(UGPA). The LSAT is administered and graded by ETS, and its 
scores range from 200 to 800. UGPA scores range from 10 to 4 0. In 
Rubin’s analysis the UGPA scores were multiplied by 200 to give 
them the same range as the LSAT scores. In what follows UGPA is to 
be taken as the original score multiplied by 200. 

One important objective of the ETS reports is to predict FYA by a 
linear rule 

FYA oc LSAT + Mx UGPA 

and the question for each law school is how to choose the multiplier 
M in this predictor. One way is for each law school to perform a least 
squares regression of FYA on LSAT and UGPA and to use the ratio 
of regression coefficients for M. In Rubin’s analysis of 1973 data, in 
order to predict 1974 FYA, this was the starting point. Data from 83 
law schools were used in this study, the results from these schools 
being treated like the typical sequence of ‘past’ results in the EB 
sampling scheme. Every individual result in turn is then treated as the 
‘current’ observation in the terminology of earlier chapters. 

The basic data used in the EB analysis comprises two least squares 
regression coefficients together with an estimated covariance matrix 
for each law school. As indicated above, the initial estimate of the 
multiplier M f for school i is taken to be the ratio r t of these two 
coefficients. For the EB analysis Rubin chose to work with a f = 
arctan (rj because its distribution is thought to be closer to 
normal than that of r h normality being an underlying assumption in 
the analysis. By the standard delta method an estimated variance, sf, 
is calculated for each a v These variances differ from school to school. 
Taking them to be the actual variances we have a situation which is 
equivalent to the case of a normal data distribution but variable m, 
discussed in section 3.9.3. Table 8.1 has been extracted from two 
tables in Rubin (1980). It shows the least squares estimate of M„ a h sf 
for 1973 in columns (2)-(4). Note that, in Rubin’s notation, least 
squares M ; = 200/r ( . Column (5) shows the EB estimate of M t for each 
school. 

Calculation of the EB estimates is according to the following 
theory. The distribution of a t for given Mi is N(pi h sf), and the /q values 
are taken to be sampled from a IV(/z*, a\). Without giving the details 
here, it should be mentioned that Rubin performed some careful 
preliminary analyses of the data in order to establish that this model is 
reasonable. The Bayes estimate of ju f is A,a ; + (1 — where X t = 
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Table 8.1 Data used in law school validity studies including least squares and 
EB estimates of multipliers M t . 

(1) = School ID (2) = Multiplier = M t ( 3) = a, = arctan (Mi/200) 

(4) = s.d. (a,) = s t (5) = EBE of M t 


(1) 

(2) 

(3) 

(4) 

(5) 

(I) 

(2) 

(3) 

(4) 

(5) 

1 

102 

1099 

0-148 

116 

42 

76 

1-209 

0-115 

99 

2 

217 

0-745 

0-201 

146 

43 

74 

1-216 

0-110 

97 

3 

211 

0-759 

0-152 

154 

44 

157 

0-906 

0-104 

143 

4 

70 

1-234 

0-159 

105 

45 

96 

1-123 

0-151 

114 

5 

82 

1-184 

0-143 

107 

46 

132 

0-987 

0-119 

129 

6 

149 

0-930 

0099 

139 

47 

81 

1-188 

0-108 

100 

7 

2507 

0-080 

0-372 

151 

48 

212 

0-757 

0-176 

149 

8 

119 

1-032 

0-157 

124 

49 

142 

0-954 

0-111 

134 

9 

175 

0-853 

0-172 

140 

50 

114 

1-054 

0-195 

123 

10 

266 

0-645 

0-171 

161 

51 

151 

0-924 

0-182 

133 

11 

179 

0-842 

0-119 

150 

52 

124 

1-016 

0-135 

125 

12 

153 

0-918 

0-138 

137 

53 

123 

1-021 

0-097 

124 

13 

125 

1-012 

0-189 

126 

54 

133 

0-983 

0-114 

130 

14 

127 

1-004 

0-158 

127 

55 

199 

0-787 

0-128 

156 

15 

125 

1-014 

0-211 

126 

56 

158 

0901 

0-186 

135 

16 

136 

0-974 

0-128 

131 

57 

105 

1-088 

0-173 

119 

17 

136 

0-973 

0-098 

132 

58 

155 

0-912 

0-094 

143 

18 

111 

1-066 

0-173 

121 

59 

163 

0-886 

0-146 

140 

19 

222 

0-733 

0181 

150 

60 

132 

0-987 

0174 

128 

20 

82 

1-182 

0-147 

108 

61 

88 

1-158 

0166 

113 

21 

84 

1-174 

0-108 

102 

62 

151 

0-924 

0-084 

142 

22 

89 

1-152 

0-122 

108 

63 

126 

1-008 

0-138 

126 

23 

89 

1-152 

0-127 

108 

64 

-24 

1-697 

0-286 

100 

24 

244 

0-687 

0-171 

157 

65 

168 

0-873 

0-259 

133 

25 

152 

0-922 

0-094 

141 

66 

81 

1-187 

0-101 

99 

26 

81 

1-186 

0-077 

94 

67 

132 

0-989 

0-125 

129 

27 

99 

1-111 

0120 

112 

68 

132 

0-989 

0-132 

129 

28 

104 

1-089 

0-265 

122 

69 

124 

1-016 

0-094 

125 

29 

62 

1-268 

0-194 

107 

70 

202 

0-782 

0-128 

157 

30 

100 

1-107 

0-148 

116 

71 

179 

0-841 

0-197 

139 

31 

158 

0-902 

0-115 

142 

72 

78 

1-198 

0-139 

105 

32 

306 

0-579 

0-076 

234 

73 

198 

0-790 

0-104 

163 

33 

142 

0-955 

0-192 

130 

74 

125 

1-013 

0-088 

125 

34 

115 

1048 

0-146 

122 

75 

254 

0-666 

0-114 

182 

35 

31 

1-419 

0-124 

77 

76 

152 

0-922 

0-150 

136 

36 

40 

1-375 

0-167 

94 

77 

150 

0-926 

0-105 

139 

37 

94 

1-130 

0-103 

108 

78 

173 

0-858 

0-178 

139 

38 

92 

1-138 

0-146 

112 

79 

140 

0-959 

0-116 

133 

39 

130 

0-995 

0-131 

128 

80 

69 

1-239 

0-161 

105 

40 

68 

1-243 

0-126 

97 

81 

141 

0-956 

0-314 

128 

41 

89 

1-151 

0-104 

105 

82 

72 

1-227 

0-119 

98 
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(1 + sf/al)~ 1 . The EB estimate is obtained on replacing /r* and 
ct* in these expressions by their estimates as calculated from 
the data. Rubin estimates and <t„ by the method of maximum 
likelihood, i.e. maximizing 

f] [2 ti(s 2 + <r 2 )] “ 1/2 exp [ - i(Ui - /r*) 2 /(s? + (T 2 )] 

i= 1 

w.r.t. g* and The actual method of calculating the estimates was 
by using the EM algorithm. See also sections 2.8.2 and 2.9. For the 
1973 data shown in Table 8.1 the estimates of /i* and ct* are, 
respectively, 1008 and 0-0139. For school No. 1 we have u x — 1-099, 
Sj = 0-148, giving the EB estimate of g x = 1-0433, and the EB estimate 
of Mi = 200/tan(l-0433) = 116-5- 
As Rubin points out, charateristically of EB estimates, the esti¬ 
mates in column (5) do not fluctuate as wildly as those in column (2). 
Rubin also reports on a validation study of the estimates by looking at 
the predictions of 1974 performances. It turns out that the EB 
estimates do perform better according to this important separate 
criterion. 

To conclude this section we remark that the estimation of /r* and 
could have been done by the method of moments, in the manner of the 
discussions of section 3.9. These estimates are given by 

(l/82)2>,- - d) 2 /((i 2 + sf )~ 1 = 1 
a = 2>i(<x 2 + s?)/£ l/((j 2 + sf). 

The results calculated from the 1973 data are //* = 1-008, a\ = 0-0136. 
They are quite close to the ML estimates. Since the method of 
moments estimate of <r 2 is greater than the ML estimate, the 
corresponding EB estimates are somewhat more variable, but they 
are still far less variable than the least squares estimates of the 
multipliers. 

8.2.2 Fitting of growth curves 

Many studies to do with the fitting of growth curves concern human 
subjects on whom a response variable Y is observed at a succession of 
age points. If, for example, Y is the height of a child, a plot of Y against 
age produces a set of points through which one might fit a growth 
curve. Obviously the associated methodology need not be restricted 
in application to data of this sort, but the terminology is convenient. 
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A common type of model in growth curve analysis is that the 
observed value of the response variable at time t is y(t) = g(6) + e, 
where e is taken to be a normally distributed random error, 
independent at every observation. Berkey (1982) reports a growth 
study in which the model g{6) is the Jenss curve, y(t) = a 0 + a 1 t 
— exp(/? 0 + P^), thought to be suitable for describing the growth of 
young children. Here the parameter 0=(a o ,a i,P 0 ,Pi) is a 4- 
dimensional vector whose prior distribution is assumed to be 
N(ji g , 1. g ). The essentially empirical Bayesian part of Berkey’s 
analysis has to do with estimation of the parameters n G and Z c from 
the data on 218 children. On each of these children there were 
between 11 and 14 (y,r) observations from which least squares 
estimates of the elements of 0 were calculated, yielding 218 of the 4- 
dimensional vectors of estimates 8 h i= 1,2,...,218. Taking these 
estimates to be individually unbiased for their respective parameters, 
the mean vector fi G is estimated by the mean of the 8 t values. The 
matrix £ G can be estimated by first obtaining the sample covariance 
matrix V of the observed least squares estimates. There is a 
conditional covariance matrix of the least squares estimates for each 
subject. The average of these matrices is subtracted from V to give an 
estimate of £ G . This was the procedure followed in Berkey (1982). In 
practice the constraint that the estimate E c should be nonnegative 
definite may have to be imposed. 

For Berkey’s data we have 


/ 24.199 

-1-840 

0-620 

0-822 \ 


-1-840 

0-595 

-0-059 

-0-092 


0-620 

-0-059 

0-024 

0-021 


0-822 

-0-092 

0-021 

0-058 , 


' 2-602 

-0-486 

0-065 

0-181 

\ 

-0-486 

0-093 

-0-012 

-0-032 


0-065 

-0-012 

00025 

0-0021 

0-181 

-0-032 

00021 

0-0126, 



fi G = (77-7785,6-4214,3-2550, -0-9919)'. 

Berkey reports the least squares estimates for six of the children, 
one of these, child number 083, has the following values: 

(72-746,8-412,3-033,-1-216). 
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Unfortunately the matrix £ f for this subject is not given, so a matrix 
close to £ was used in a calculation of the EB estimate for this subject. 
The matrix actually used was £ with the diagonal elements replaced 
by (3-00, 011, 0 0030, 0-0160). This substitution was made largely 
to produce a positive definite According to formula (4.4.1), 
Chapter 4, 

EB estimate = (£f 1 + £« T 1 0 t + LJ ‘/U- 
The result is 

(73-880,8-136,3 072,-1-159), 

which differs somewhat from the result given by Berkey, not only 
because of the use of £ instead of £ f but also because Berkey’s method 
of obtaining an EB estimate is to obtain a posterior mode in the 
manner of Lindley and Smith (1972). 

8.2.3 Predicting baseball batting averages 

The second column of Table 8.2 gives the batting averages of 18 
players over their first m = 45 ‘at bats’ in the 1970 season. Multiplying 
these averages by 45 gives the integer scores in the third column; call 
these Zj. These results are extracted from Efron and Morris (1975) 
according to whom the Z, = mY i can be regarded as independent 
binomial (m,pj random variables, i= 1,2, ...,k. In order to use 
previously developed theory connected with the Stein estimator (see 
for example James and Stein, 1961), Efron and Morris work with 

= f m (Yi) = (m) 1/2 arc sin(2T, - 1), 

because the distribution of X t can be taken as approximately normal 
with mean 0, = f m {Pi) and unit variance. Developing an EB estimator 
of 0, the 0 f values are assumed to be independently generated by a 
N(ji, z 2 ) distribution. The Bayes estimator for 0 f can be written as 

M +{1-(1+t 2 )- 1 }(Z,.- / i) 

and an EB version is obtained when estimating p by x and 1/(1 + r 2 ) 
by (k — 3)/ V, where V = X(x,- — x) 2 . From the data in Table 8.2 the 
values of x and (k — 3)/F are — 3-275 and 0-791, giving the EB estimate 

^(x,) = 0-791 x + 0-209x,- = 0-209x; - 2-59. 

For the purpose of this exercise the values in columns (5) and (6) are 
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Table 8.2 Column descriptions: (1) Player number (2) Batting average for 
first 45 at bats = y, (3)z t = 45y ( (4)x t = f m {yf (5) Batting average for the 
remainder of season = p t ( 6) (1, = f m p i ) (1) 5 1 ( x f ) (8) retransformed d 1 ( x ,) 
(9) binomial EBE of p, 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

(9) 

1 

0-400 

18 

— 1 35 

0-346 

— 2-10 

-2-87 

0-292 

0-273 

2 

0-378 

17 

—1-66 

0-298 

-2-79 

-2-94 

0-288 

0-272 

3 

0-356 

16 

—1-97 

0-276 

— 3-11 

-3 00 

0-284 

0-270 

4 

0-333 

15 

-2-28 

0-222 

-3-96 

-3-07 

0-279 

0-269 

5 

0-311 

14 

-2-60 

0-273 

— 3-17 

— 3-13 

0-275 

0-268 

6 

0-311 

14 

-2-60 

0-270 

-3-20 

— 3-13 

0-275 

0-268 

7 

0-289 

13 

-2-92 

0-263 

-3-32 

-3-20 

0-270 

0-267 

8 

0-267 

12 

-3-26 

0-210 

-4-15 

-3-27 

0-266 

0-266 

9 

0-244 

11 

-3-60 

0-269 

— 3-23 

-3-34 

0-261 

0-264 

10 

0-244 

11 

-3-60 

0-230 

-3-83 

-3-34 

0-261 

0-264 

11 

0-222 

10 

— 3 95 

0-264 

-3-30 

-3-42 

0-256 

0-263 

12 

0-222 

10 

-3-95 

0-256 

-3-43 

-3-42 

0-256 

0-263 

13 

0-222 

10 

-3-95 

0-303 

— 2-71 

-3-42 

0-256 

0-263 

14 

0-222 

10 

-3-95 

0-264 

-3-30 

-3.42 

0-256 

0-263 

15 

0-222 

10 

— 3-95 

0-226 

— 3-89 

-3-42 

0-256 

0-263 

16 

0-200 

9 

-4-32 

0-285 

-2-98 

-3-49 

0-251 

0-262 

17 

0-178 

8 

-4-70 

0-316 

-2-53 

-3-57 

0-246 

0-261 

18 

0-156 

7 

— 5-10 

0-200 

-4-32 

-3-66 

0-241 

0-259 


taken as the parameter values for individual batters, and these are 
being estimated using the results in columns (2) to (4). Column (7) 
gives the EBE ^ 1 (jc.) of 0„ column (8) the retransformed S i (x i ), i.e. an 
EBE of Pi. The values in column (8) differ in the third decimal place 
from the corresponding results in Table 2 of Efron and Morris; the 
differences are due to rounding at stages of the calculation. 

The ‘binomial EBE’ values in column (9) were obtained by applying 
the linear EB technique of section 3.7.2 to the results in column (3). 
Applying formulae in section 3.7.2(c) to the results in column (3) of 
Table 8.2, the linear EBE is found as p(x t ) = 0-2508 + 0 0012262,-. 

The ML estimates of the p t are the values shown in column (2). The 
sums of squared differences between actual and estimated p, values 
are: 

maximum likelihood : 0 0754 
retransformed (5 1 : 0-0214 

binomial EBE : 0-0228 


Both EBEs perform notably better than the MLE in this example. 
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8.2.4 Teacher expectancy and pupil IQ 

Raudenbush and Bryk (1985) examine the results of n = 19 independ¬ 
ent studies in each of which the effect of teacher expectancy on pupil 
IQ was estimated by an ‘effect size’, d h being the mean difference 
between experimental and control children divided by a standard 
deviation pooled within groups. Table 8.3 gives values of d h a 
standard error d, of each d { and other data. In the notation of 
section 8.1, df = d 2 /m,. 

In the analysis given by Raudenbush and Bryk the distribution of 
every d t given the ‘true’ <5 ; is assumed normal with variance v ( . In the 
calculations v ; is taken to be equal to the square of the standard error 
shown in the last column of Table 8.3. The prior distribution from 
which the S t are sampled is assumed normal, N(p G , a G ). 

Following Rubin (1980) one can estimate p G and a G by the method 
of maximum likelihood, using the fact that the marginal distribution 
of dj is normal N(p a ,a G + v,). Then the MLE of a G is given by a G 


Table 8.3 Results of experiments assessing the effect of teacher expectancy on 
pupil IQ 


Weeks of 

prior contact Effect size 

Study c f d t 


1. Rosenthal et al. (1974) 

2. Conn et al. (1968) 

3. Jose & Cody (1971) 

4. Pellegrini & Hicks (1972) 

5. Pellegrini & Hicks (1972) 

6. Evans & Rosenthal (1968) 

7. Fielder et al. (1971) 

8. Claiborn (1969) 

9. Kester (1969) 

10. Maxwell (1970) 

11. Carter (1970) 

12. Flowers (1966) 

13. Keshock (1970) 

14. Henrikson (1970) 

15. Fine (1972) 

16. Greiger (1970) 

17. Rosenthal & Jacobson (1968) 

18. Fleming & Anttonen (1971) 

19. Ginsburg (1970) 


2 

2 

003 

0-125 

21 

3 

012 

0-147 

19 

3 

-014 

0-167 

0 

0 

118 

0-373 

0 

0 

0-26 

0-369 

3 

3 

-006 

0-103 

17 

3 

-002 

0-103 

24 

3 

-0-32 

0-220 

0 

0 

0-27 

0-164 

1 

1 

0-80 

0-251 

0 

0 

0-54 

0-302 

0 

0 

018 

0-223 

1 

1 

-002 

0-289 

2 

2 

0-23 

0-290 

17 

3 

-0-18 

0-159 

5 

3 

-006 

0-167 

1 

1 

0-30 

0-139 

2 

2 

007 

0-094 

7 

3 

-007 

0-174 
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satisfying 

and fl G = £d f (<r| + + v,) _ 1 . The solution has to be found 

iteratively, for example by starting with a trial value of 6%, calculating 
the trial fi G , and continuing in this manner. Alternatively the EM 
algorithm can be used as described in Rubin (1980). 

The results of such calculations using the data in Table 8.3 are (t G 
= 01011, a G = 0-0456. Individual EB estimates are given by {d t & G 
+ h v d/(h + v i); for example, the EBE for study 4 in the table 
becomes 0-3674. 

A linear Bayes approach can be adopted, without the assumptions 
of normality, as outlined in sections 1.12 and 4.6. Since marginally we 
have E(di - p G ) 2 /( a G + v f ) = 1, we can estimate Hg h by solving 

m - fi G ) 2 /(°G + vj = n, 

with (i G defined as in the case of ML estimation. This alternative 
approach gives / l G = 0-0777, d G = 0-0125, and the EBE for study 4 as 
0-1685. 

Table 8.3 also gives values of a concomitant variable C = weeks of 
prior contact. Raudenbush and Bryk coded c-values ^ 3 to 3 to give 
Cj as in Table 8.3. A quick examination suggests that there is an 
association between D and C which might be taken into account in 
developing EB estimates of individual <5 ; values. From equations 
(4.7.2) and (4.7.3) we obtain the estimates 

y 0 = 0-407, y j = - 0-157, f :2 = 0000. 

According to Raudenbush and Bryk (1985) the EB estimate of <5, can 
be written as 

hi = W* 2 + (f 0 + f lCi)Vi}/(Vi + t 2 ). 

The linear Bayes method gives the following estimates: 

= 0-418, ^ = - 0-162, f 2 = 0-006. 


8.3 Examples involving standard discrete distributions 

8.3.1 Bacteria in samples of drinking water 

Von Mises (1943) refers to a study of water quality in which n = 5 
samples were taken from batches of water, interest being in whether a 
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sample contains at least one bacterium. Let 9 be the probability of a 
positive result, i.e. at least one bacterium present. For a given batch 
the probability of x positive results is given by the binomial (5, 6) 
distribution. In repetitions of the procedure with successive batches 
the marginal X distribution is the mixed binomial 


PgM = 



9 x (l - 9) 5 - x dG(6). 


Von Mises discusses estimation of the mixing distribution using a 
sample of N = 3420 observations on the marginal distribution. The 
actual problem considered by von Mises was not one of estimating 9 
but one of evaluating the decision rule according to which a batch of 
water is accepted as having 6 in the range 0 to 9 1 (= 0-63) when x = 0 
is observed. Thus interest centres on the conditional probability 


P(0<&^9 1 \X = 0) = 



9(l-9)dG(9)/p G (x). 


After first estimating the first two moments of G(9 ) a lower bound for 
P(0 < 0 < 0! | X = 0) is obtained. A similar calculation is performed 
for the case X = 1. In fact, this procedure is mathematically close to 
the point estimation discussed in earlier chapters. In both cases an 
estimate is made from previous data of the ratio of two integrals with 
respect to G(9) of functions of 9 indexed by x. 


8.3.2 True scores in psychological testing 

In a typical situation considered by Lord (1969) a subject is required 
to answer m items in a test. On each item the score is 1 or 0 according 
as the answer is correct or not, and the test score is the sum, X, of the 
item scores. A simple model is that the probability of a correct answer 
is 9 on each item. Lord (1969) also considers the more realistic model 
in which 9 varies from item to item but we shall look only at an 
example in which 9 is constant for each subject. The probability 9 is 
also referred to as the true score of the subject. Of course, the EB 
aspect of this problem is that 9 varies from subject to subject. Under 
these conditions the distribution of X for a given subject is 
binomial ( m, 9), and the marginal X distribution is a mixed binomial. 

Cressie (1979) presents a set of data on N = 12 990 subjects on a test 
with n = 20 and calculates EB estimates of 9 on the assumption that 
the marginal X distribution is a mixed binomial. Cressie proposes a 
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Table 8.4 Observed frequency distribution of scores and EB 
estimates of true scores in a psychological test 


Score 

X 

Frequency 

Nf(x) 

Simple EB 
0(x) 

Smooth EB 

20 

63 

0-950 

0-898 

19 

141 

0-879 

0-863 

18 

220 

0-824 

0-823 

17 

319 

0-774 

0-773 

16 

424 

0-713 

0-716 

15 

622 

0-672 

0-663 

14 

776 

0-629 

0-619 

13 

1001 

0-586 

0-583 

12 

1203 

0-546 

0-553 

11 

1443 

0-515 

0-526 

10 

1550 

0-503 

0-500 

9 

1409 

0-483 

0-475 

8 

1235 

0-445 

0-451 

7 

1052 

0-423 

0-426 

6 

696 

0-407 

0-402 

5 

471 

0-369 

0-379 

4 

226 

0-362 

0-356 

3 

98 

0-314 

0-334 

2 

27 

0-283 

0-314 

1 

12 

0-144 

0-296 

0 

2 

0-050 

0-280 


method of obtaining an approximate simple EB estimate, i.e. an EB 
estimate which can be calculated without explicitly estimating the 
mixing distribution. Let 9(x) be the EB estimate of the true score of a 
subject whose observed score is x. Cressie gives the following formula 
for 0(x): 

1/m, x = 1, 

x/m + (1 - 2x/m)/m + (x/m)( I - x/m) 

mX) = 

x {/(* + !)-/(*- l)}/{2/(*)}, x = 1 , 2 ,..., m — 1 , 

1 — 1/m, x = m, 

where f(x) is the observed proportion of subjects with score x. 

Table 8.4 gives a frequency distribution of observed scores, the 
estimate 0(x) and a smooth estimate obtained according to the 
method of estimating the mixing distribution given in Lord (1969). 
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8.3.3 Acceptance sampling 

A common sampling inspection model supposes that batches of items 
are inspected by sampling a relatively small number of items from 
each batch, and on the basis of the quality of the sampled items a 
decision is made about the quality of the batch. In the simplest cases a 
batch is supposed to contain a proportion X of ‘defective’ items, the 
rest being ‘good’. The decision becomes a matter of estimating the 
value of X, or of accepting or rejecting the hypothesis that X is smaller 
than some critical value. More complicated procedures are also 
considered, as we indicate below. 

What makes a Bayesian approach to this problem quite natural is 
that it seems reasonable to assume that X varies randomly from batch 
to batch, according to a prior distribution function G(X). In this 
context schemes in which the loss structure takes the cost of sampling 
into account have been considered by many authors. Hald (1960) 
takes the loss caused by an accepted defective item as 1, the cost per 
item associated with rejected batches as k r , and the cost of sampling as 
k s per item. The cost of each rejected batch is nk s + (N — n)k n and of 
each accepted batch is nk s + (N — x), where x is the number of 
defective items in a sample of n items drawn at random from the 
batch. 

Let p(x | A) be the probability of getting x defectives in the sample 
when the proportion of defectives in the batch is X. Suppose the 
decision rule is to accept batches when x ^ c and to reject when x > c. 
Then the average cost for batches of quality X is 

K(n,c,X) = nk s + £ (NX-x)p(x\X) + (N - nk r ) £ p(x\X) 

jc = 0 x = c+ 1 

The overall average cost is K(n,c) = ^ 0 K(n,c,X)dG(X), where G(X) is 
the prior distribution function of X expressing the batch to batch 
random variation in X. We can write K(n, c) as 

K(n, c) = nk s + £ [ATE(A | x) - x]p G (x) + (N — nk r ) £ p G (x), 

x=0 JC = C+1 

where p G (x) = §p(x\X)dG(X), as before. 

Usually an optimum combination of n and c can be found, i.e. such 
as to minimize K(n, c). This, and similar mathematical problems have 
been treated in detail by Hald (1960, 1967), Wetherill and Campling 
(1966) and others. Hald (1960) refers to an example of Kjaer (1957) 
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giving an empirical prior distribution obtained after full inspection of 
a number of batches. Such a distribution can be used instead of the 
true G(A) to devise an empirical Bayes sampling inspection plan. More 
commonly full inspection will not be carried out, but observations 
providing an estimate of p G {x) may be available, thus providing the 
data for application of the EB methods that have been developed. For 
a comment on this approach see Bohrer (1966). 

A somewhat simpler example is treated by Cressie and Seheult 
(1985) who consider estimation of the number of defectives, d, in a 
batch, when a sample of size m is taken without replacement from a 
batch of size M; in our notation d = ml. The distribution of X given d 
is hypergeometric. A simple EB approach is possible through the 
relation 


0(x) = E 


D 


M — m+1— D + x 


X = x 


(x + 1 )p G (x + 1) 

(m — x)p G (x) 

x = 0,l,...,m — 1, 


where p G (x) is the marginal X probability function, as in earlier 
chapters. Cressie and Seheult use an approximation for E(D | X = x) of 
the form 


x + (Af — m) 


1 + 0(x) 


1 - 


0(x)-0(x-l) 


0(x){l+0(x-l)}+0(x)+l 


In the numerical example given by Cressie and Seheult the hypergeo¬ 
metric distribution is assumed to be adequately approximated by a 
binomial distribution, the probabilities p c (x) are estimated directly 
by observed frequency ratios, and the estimate of 0(x) is monotonized 
according to the method of van Houwelingen (1977). The data 
reported by Cressie and Seheult are shown in Table 8.5. The coding of 
answers in question forms used in a household survey was checked by 
random selection of m = 42 forms from each of 91 batches of forms, 
the observed X being the number of forms with errors in one 
particular question. The batch sizes varied as shown in Table 8.5. 

A linear EB approach to this example is possible in which one can 
use the formulae 


E(X | A) = mX/M 

£(X 2 |A) = m(M-m)A/{M(M- 1)} 

+ [(m/M) 2 - m(M - m)/{M 2 (M - 1)}]A 2 . 
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Table 8.5 Batch number i, batch size M f , observed number of defectives in 
samples of size m = 42, and nearest integer to the monotonized simple EBE f}f 


i 



pr 

i 

M,- 

X| 

P? 

i 

M,. 


Pf 

i 

238 

0 

1 

31 

201 

0 

1 

61 

193 

0 

1 

2 

199 

0 

1 

32 

140 

0 

1 

62 

233 

1 

3 

3 

216 

0 

1 

33 

232 

1 

3 

63 

229 

1 

3 

4 

145 

0 

1 

34 

141 

1 

2 

64 

171 

0 

1 

5 

248 

3 

14 

35 

208 

0 

1 

65 

192 

0 

1 

6 

228 

1 

3 

36 

215 

1 

3 

66 

186 

0 

1 

7 

215 

0 

1 

37 

143 

0 

1 

67 

212 

0 

1 

8 

254 

0 

2 

38 

180 

0 

1 

68 

170 

0 

1 

9 

210 

0 

1 

39 

240 

0 

1 

69 

224 

0 

1 

10 

185 

0 

1 

40 

223 

0 

1 

70 

256 

0 

2 

11 

165 

2 

7 

41 

258 

1 

3 

71 

207 

2 

9 

12 

221 

0 

1 

42 

175 

3 

10 

72 

141 

1 

2 

13 

257 

1 

3 

43 

250 

3 

14 

73 

258 

0 

2 

14 

241 

4 

19 

44 

257 

0 

2 

74 

249 

0 

1 

15 

198 

0 

1 

45 

226 

0 

1 

75 

222 

0 

1 

16 

190 

0 

1 

46 

243 

0 

1 

76 

196 

0 

1 

17 

160 

0 

1 

47 

259 

0 

2 

77 

254 

3 

15 

18 

150 

0 

1 

48 

249 

0 

1 

78 

176 

3 

10 

19 

239 

0 

1 

49 

249 

0 

1 

79 

256 

4 

20 

20 

212 

1 

3 

50 

204 

0 

1 

80 

156 

1 

2 

21 

226 

0 

1 

51 

248 

2 

11 

81 

253 

0 

2 

22 

194 

3 

11 

52 

195 

1 

3 

82 

142 

1 

2 

23 

244 

0 

1 

53 

175 

0 

1 

83 

257 

0 

2 

24 

226 

0 

1 

54 

147 

0 

1 

84 

190 

0 

1 

25 

166 

0 

1 

55 

245 

0 

1 

85 

238 

0 

1 

26 

220 

0 

1 

56 

180 

5 

16 

86 

237 

1 

3 

27 

174 

0 

1 

57 

187 

0 

1 

87 

228 

4 

18 

28 

168 

0 

1 

58 

256 

1 

3 

88 

185 

1 

3 

29 

159 

0 

1 

59 

189 

1 

3 

89 

216 

0 

1 

30 

186 

0 

1 

60 

228 

1 

3 

90 

145 

2 

7 









91 

182 

0 

1 


8.3.4 Oilwell discoveries 

Table 8.6 gives data on oilwell discoveries used by Clevenson and 
Zidek (1975) in a study of simultaneous estimation of the means of 
independent Poisson laws. The numbers of discoveries, X ( , in certain 
months are shown. They are taken to be observations on Poisson 
random variables with means A ; . From additional data more accurate 
estimates of the A values are available. These are also shown in the 
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Table 8.6 Observed monthly numbers of oilwell discoveries, 
Xj and corresponding expected values, X t 


i 

Xt 

A, 

i 


A,- 

i 

0 

117 

19 

1 

0-50 

2 

0 

0-83 

20 

2 

0-50 

3 

0 

0-50 

21 

0 

1-33 

4 

1 

100 

22 

0 

0-83 

5 

2 

0-83 

23 

0 

0-33 

6 

1 

0-83 

24 

1 

1-50 

7 

0 

117 

25 

5 

1-33 

8 

2 

0-83 

26 

0 

0-67 

9 

0 

0-67 

27 

1 

0-67 

10 

0 

0-17 

28 

0 

0-33 

11 

0 

0-00 

29 

0 

0-33 

12 

1 

0-33 

30 

1 

0-33 

13 

3 

1-50 

31 

0 

0-50 

14 

0 

0-50 

32 

1 

0-83 

15 

0 

117 

33 

0 

0-67 

16 

3 

1-33 

34 

1 

0-33 

17 

0 

0-50 

35 

0 

0-00 

18 

2 

117 

36 

1 

0-50 


table, and for the purpose of this exercise are assumed to be the true A 
values. 

The simultaneous estimator of (A!, A 2 ,..., A„) derived by Clevenson 
and Zidek is 

{Z/(Z + fi + n-l)}(X l ,X 2t ...,X ll ), 

where Z = X l + X 2 -\ -1- X n . Their approach is one of compound 

estimation in which the loss when estimating by (<5 i,< 5 2 ,...,<5J is 

m-w,. 

A straightforward Bayes approach to the problem is to use the 
gamma prior density g{X) = (l/r(/?))a^“ 1 exp(-aA). Then the loss 
structure (<5 — A) 2 /A leads to the Bayes estimate 

<5 G W = (P + x- l)/(a + 1). 

If we put /J = 1 and note that the mean of the marginal X distribution 
is then E(X G ) = JaA exp (— aA)dA = 1/a, a natural estimate of 1 /a from 
past data is Zjn. This gives the empirical Bayes estimate 

S G (x)={Z/(Z + n)}x, 
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which is exactly of the same form as the compound estimate with 
/? = 1. This is the version of the estimate used by Clevenson and 
Zidek in the illustration with the oilwell data. 

For the data in Table 8.6 we have Z = 29 with n = 36, so the EB 
estimate is § G (x) = (29/65)x. Also, — = 35M2, and 

— Ai) 2 /lj = 14-26, showing substantial improvement of the 
estimation by using EB rather than ML. 


8.3.5 Cancer mortality rates 

Table 8.7 is extracted from Tsutakawa, Shoop and Marienfeld (1985). 
It shows an estimate of the mid-period number of persons at risk, m h 
and the number of stomach cancer deaths, y„ in the i = 1,2,..., 84 
largest cities in Missouri for the period 1972-81. The data are for 
males aged 45-64 years. 

It is assumed that y t is a realization of a Poisson random variable Y ( 


Table 8.7 Stomach cancer mortality in Missouri cities, males aged 45-64 years 
in 1972-1981 



Ti 

m, 

D 

m i 

Ti 

m, 

y t 

98066 

99 

1185 

0 

647 

1 

443 

0 

53637 

54 

1104 

0 

631 

1 

423 

1 

46394 

80 

1083 

0 

603 

0 

419 

2 

12890 

17 

1025 

1 

601 

0 

403 

0 

10975 

11 

917 

1 

600 

1 

395 

0 

7436 

13 

917 

0 

592 

0 

395 

0 

3814 

3 

877 

0 

588 

3 

389 

1 

3461 

2 

874 

0 

583 

1 

386 

0 

3349 

1 

857 

1 

582 

3 

383 

0 

3215 

1 

855 

0 

582 

0 

372 

1 

2708 

2 

854 

0 

581 

1 

368 

1 

2530 

4 

842 

0 

556 

0 

350 

1 

2145 

4 

799 

2 

527 

0 

339 

1 

1823 

2 

731 

5 

524 

1 

333 

1 

1668 

2 

721 

0 

522 

0 

325 

0 

1627 

3 

709 

1 

517 

1 

318 

0 

1407 

1 

706 

3 

512 

1 

317 

0 

1356 

0 

680 

1 

493 

0 

312 

1 

1339 

3 

676 

1 

490 

1 

307 

0 

1209 

1 

664 

0 

481 

0 

305 

0 

1208 

1 

657 

0 

448 

1 

305 

0 
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with mean The Y ; are independent of each other, and the A f 
values are regarded as being sampled at random from a prior 
distribution G. In the analysis given by Tsutakawa et al. (1985) the 
reparametrization a, = In (A,/(l — A,)} is used, and the distribution 
from which the a, are sampled is taken as N(p.,o 2 ). Under these 
assumptions the marginal probability of is 

Pa(yi) = j'exp(-m i A i )(m i A i ) J "(l/y i !)(l/<r)</>{(a i - p)lo}d* i 

where </>(•) is the standard normal density, and A, = 1/{1 -f exp(— «,)}, 
This p G (y,) depends on the two parameters p and a which can be 
estimated by the ML method, i.e. by maximizing the marginal 
likelihood Pg(Ti)Pg(> , 2) •Pg(T»)- The posterior mean of A, is 


E(A,\y t ) 


{l+exp( —aj)} l h(<x i \y i ,n,ff)da i , 


where h{ct i \y i ,ii, <x) is the posterior density of a„ written down in an 
obvious way from the given assumptions. Replacing n and a by their 
ML estimates (i and a gives the EB estimate X iEB ; the usual MLE of A ; 
is X, = yM- 

For city i the crude mortality rate is calculated as r, = X f x 10 s , and 
the corresponding EB rate is r iEB obtained by replacing A, by X iEB . 
Table 8.8 gives crude and EB rates for a selection of the cities listed in 
Table 8.7. The large fluctuations exhibited by the crude rates are 
smoothed out by the EB approach. 

An alternative EB approach is the linear EB method as given in 
sections 3.8.3 and 3.9.1. The linear Bayes estimate of A ( is 

Aj = (o 0 + o) 1 y i /m i , 


Table 8.8 Crude and EB mortality rates for selected 
Missouri cities 


m , 

Vi 


r IEB 

r iLEB 


99 


102-8 

118-3 

3215 

1 

311 


117-5 

731 

5 

684-4 

140-3 

125-6 

664 

0 

0-0 

110-8 

117-1 

333 

1 

300-1 

118-6 
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where cy 0 and io 1 are given by 

1 £(A) Jo) 0 1 = r£(A)' 

_£(A) £(A 2 ) + £(A)/mjLa»J U(A 2 )_' 

In order to obtain an empirical version of A f we need estimates of 
y x — £(A) and y 2 = £(A 2 ). Possible estimates are 

~h = (!/«) Z yJ m i 

i ~ 1 

h = (!/«) Z - i)M 2 - 

i= 1 

Note that slightly different forms of estimators are given in sec¬ 
tion 3.9.1. Applying this method to the data in Table 8.7 gives jq = 
0 00119 and y 2 = 1448 x 10” 6 , and the linear EB estimates of rates 
given by these values are shown in Table 8.8 under the heading 

r iLEB- 

8.3.6 A quality measurement plan (QMP) 

Suppose that a quality control scheme is operated in which numbers 
of defects, x h are observed in audit samples at rating periods 

i- 1,2. n. A traditional method of statistical quality control is by 

plotting a Shewhart-type chart, in this case a T-rate chart of values of 
Tj = (e i — X; )/yJei, i = 1,2,..,, n plotted in sequence. In this definition 
of T ; the symbol e { is the expected rate according to a set standard. In 
an attempt to develop an even more useful technique for quality 
control and measurement Hoadley (1981a, b) proposed an EB 
approach to the problem, actually what is now known as a Bayes-EB 
approach, following the FB approach discussed in section 7.2.2. 

Let Wj be the audit sample size at the ith period. The observation x t 
is taken to be a realization of a r.v. X, whose mean is m, a„ where a ( is 
the true defect rate at the ith period. The rate a f is regarded as a 
realization of a random process. The model proposed by Hoadley is 
specified as follows: 

1. The distribution of X t is Poisson with mean sm^, noting the 
reparametrization X = a Js, where s is a standard defects rate 
assumed known and fixed in advance. The sequence {AJ, 
i = 1, 2 ,..., n, is assumed to comprise independent realizations of a 
r.v. A whose distribution is gamma with mean <f> t and variance <j> 2 - 
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2. The data set available for estimating the A;’s is the sequence 
{/ f ; i = 1,2,..., n}, where 1 { = xje i is an unbiased estimate of A, at 
the ith rating period. The ratio I t is also called the quality index. 

3. In a Bayes approach to the problem the parameters <j> 1 and (f> 2 are 
assumed to have a proper prior distribution function P(<f>) defined 
such that the marginal distribution of is gamma with mean and 
variance determined from the distribution of quality between 
different product types. The marginal distribution of o = 
a 2 /(a 2 + <t> 2 ) is gamma with mean and variance again depend¬ 
ing on factors such as distribution of quality between product 
types. The parameter a 2 is the average sampling variance over the 
time interval covering n rating periods, and it is estimated 
independently as discussed below. 

The exact posterior distribution of the current defect rate, given the 
observed number of defectives, can in principle be obtained by 
substituting appropriately in formula (7.2.1). The distribution G(A| tj>) is 
taken as gamma with scale parameter <f> 2 /<t>i and shape parameter 
<j>\l<t> 2 and a proper prior P(</>) has to be used for <f>. Two difficulties 
generally arise in implementing these ideas. One is that the choice of 
P is not obvious, the other is that rather complicated numerical 
integration may be required. These difficulties have led to the 
approximate methods mentioned in Chapter 7 to obtain the posterior 
distribution of A and its percentiles. The approximate solution given 
by Hoadley (1981a, b) can be summarized as follows: 

1. Calculate the ‘Bayes’ estimate, as the mean of the conditional 

distribution of the process mean, 4>i , given I it i = 1,2. n. This 

estimate is a weighted average of I t values obtained by using the 
form of marginal prior distribution assumed above. 

2. Calculate the ‘Bayes’ estimate, d>, of c o as the mean of the 
conditional distribution of a> using the form of marginal prior 
assumed above. It is of the form 

d) = 6 2 /D(S) 

where D is a known function of S = i<?,(/; — $i) 2 , and v and 
q h i = 1,2,..., n, are known constants. 

3. At the ith component the sampling variance of is A f /e f . The 
average sampling variance over all rating periods can then be 
estimated as 6 2 = £"= x q^IJe,), where the q t are weights deter¬ 
mined so as to obtain an optimal linear estimate. 
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Table 8.9 Quality inspection data 


1977 period 

3 

4 

5 

6 

7 

8 

Rating period (i) 

1 

2 

3 

4 

5 

6 

Number inspected (n ; ) 

500 

500 

500 

500 

500 

500 

Number of defects (x,) 

17 

20 

19 

12 

7 

11 

Quality index (/,) 

2-4 

2-9 

2-7 

1-7 

10 

1-6 


4. Calculate the estimate of the process variance (p 2 from the relation 
(b = d 2 /(d 2 + (f> 2 ). 

5 . Calculate the estimate of the current shrinkage factor w k , where 
k = n + 1, as 

A = i/e k )/{{$i/e k ) + $2}- 

6. Calculate the EB estimate of the current quality A* as 

K = w k $ k + (1 - w k )I k . 

Hoadley (1986) gives a summary of the earlier work and illustrates the 
QMP technique on the small data set reproduced in Table 8.9. 

To obtain a QMP plot one needs to know, for a selected ‘current’ 
rating period k, say, the posterior mean and variance of A*. In the 
following we illustrate the calculation of the posterior mean at k = 6. 
The equality of the sample sizes m, simplifies formulae by making all 
weights Pi equal to each other; the same applied to the q t weights. 
The processes average is estimated as 

$i=t IJ 6 = 2 - 0 . 

1=1 

The average sampling variance is estimated as 

* 2 =t (/i/e.-)/6, 

i = 1 

and since the standard defects per unit is s = 0 014, we get 6 2 = 0-28. 

For this example Hoadley (1987) used an approximate solution to 
obtain an estimate of the process variance (f> 2 as the difference 
between the estimated ‘total variance’ and the estimated average 
sampling variance. The total variance is estimated by 

5 = 054. 

i= 1 

Thus the process variance estimate is 0*26. 
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The posterior mean of A 6 is calculated as follows: 

w 6 = ($i/e 6 )/{($Je 6 ) + $ 2 } = 0-28/0-54 = 0-52; 

X 6 = (0-52)(20) 4- (0-48)(l-6) = 1-8. 

Hoadley (1987) goes on to estimate the posterior variance of A as 
var(A 6 |/ 1 ,/ 2 ,...,/ 6 )^0T5. The shape and scale parameters of the 
posterior gamma distribution are estimated as 

4 = Xi/var(A 6 |/ 1) / 2> ...,/ 6 ) = 21-6, 

/? = var(A 6 |/ 1 ,/ 2 ,...,/ 6 )/X 6 . 

The 1st, 5th, 95th, 99th percentiles are then obtained respectively as 
1-01, 1-2, 2-5, 2-8, and it is noted that the posterior 1st percentile is 
greater than the standard set at 1-0. 

By contrast the corresponding T-rate is 

T 6 = (e 6 - x 6 )/Je 6 = (7 - 1 1)/Jl = - 1-5, 

which is within the control limits (— 2-0, + 2-0). 

It will be noted that the illustration above is very similar to a linear 
EB approach for the Poisson case as far as point estimation is 
concerned. More complicated assumptions about the priors are 
needed to derive the posterior distribution. These assumptions, and 
the availability of suitable data for estimating hyper-prior para¬ 
meters, are crucial for the results to be reliable. 


8.4 Miscellaneous EB applications 

8.4.1 Calibration 

A typical rather simple calibration problem arises in the following 
way: accurate measurements x 1 ,x 2 ,...,x n , are made of a certain 
characteristic of n objects. A quick, less accurate method of measure¬ 
ment gives results y u y 2 ,..., y„. These results are used to estimate a 
and /? in the relation 

y ; = a + fix, + e„ 

where the e t are independent identically distributed errors. Then the 
estimates & and ft are used to estimate the true value, x, of a new object 
which yields the quick result y. An obvious estimate of x is (y — 

Of course, if a and /? are known the natural estimate is (y — a)//?. 

If x is regarded as a realization of a random variable X with 
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distribution function G(x), a Bayes estimate of y can be developed. 
Suppose for the moment that a. and ft are known, and to simplify 
calculations, that the distribution of e is N( 0, a 2 ) and that G(x) is 
N(ji g , a G ). Then straightforward application of results of the bivariate 
normal distribution gives the Bayes estimate of x : 


E(X\y) = 


P 2 OG{(y-<x)/P} + ° 2 HG 

ft 2 al + a 2 


In some calibration experiments the initial set of x values is controlled 
and so does not provide any information about G(x). On the other 
hand, if those initial x values are sampled at random, like x, then they 
can be used to obtain estimates of ii G and g g . Replacing these two 
parameters by their estimates gives an empirical Bayes estimate of x. 
Usually a and ft will not be known and may be replaced by estimates A 
and ft derived by the method of least squares, or some other suitable 
method. Suppose that S xx , S yy , S xy , are the sample variances of X and 
V, and the sample covariance of X and Y, while x and y are the sample 
means. Then, if the least squares estimates are used for a and ft, and 
noting that a G is estimated by S xx , and ft 2 a G + a 2 by S, the empirical 
version of the Bayes estimate of y given above is 


E(X\y) = ^(y-y) + x, (8.4.1) 

“yy 

also known as the ‘inverse estimator’. 

A somewhat more general approach is presented in Lwin and 
Maritz (1980) where the connection between y and x is written as 


y - m(x, 0) + e. 


and the conditional distribution function of Y given X = x is taken to 
be F[{y —m(x, 0)}/a]. Then, by the, same argument as above the 
Bayes estimate of x is 


E(X\y) = 



— m(x, 0)}/a]dG(x) 


/[{y-m(x, 0)}/ff]dG(x). 


Also, if 6 and a are estimates of 9 and o, and if the observed x values 
are sampled at random like the current x to be estimated, an empirical 
version of the Bayes estimate is 
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It is interesting to note that this estimate is not the same as the 
estimate given by (8.4.1) even when m(x,$) = a + fix and F is the 
standard normal distribution function. It is not linear in y, unlike 
both the inverse and the ‘classical’ estimate (y — A)/fi, where A and P 
are the least squares estimates of ol and /?. 

An illustration of the application of these estimates to data on the 
water content of soil specimens reported by Aitchison and Dunsmore 
(1975, p. 182) is given in Lwin and Maritz (1980). These data are 
shown in Table 8.10. In every (x i ,y i ) pair the x value is an accurate 
laboratory determination and the y value is an ‘on site’ measurement. 
The form of m(x, 6) was taken to be a 4- fix, and least squares 
estimates of a and p were used. A normal probability plot of residuals 
suggested that F could be taken as the standard normal distribution 
with p.d.f. 4>(u). Thus the non-linear estimate becomes 

*1,00= Z x i4>{(y-&-fix)/6} I £ <f>{(y-A-fix)/&}, 

i= 1 / i= 1 

where A and P are the least squares estimates of a and p. 


Table 8.10 Laboratory measurement x, and on site meas¬ 
urement y, of water content of soil samples, and three 
predictors of x t : (1) classical (2) inverse (3) non-linear 


i 


y. 

O) 

(2) 

(3) 

i 

35-3 

23-7 

32-5 

32-2 

33-2 

2 

27-6 

20-2 

28-3 

28-2 

270 

3 

36-2 

24-5 

33-5 

33-2 

33-9 

4 

21 6 

15-8 

22-7 

22-8 

22-9 

5 

39-8 

29-2 

400 

39-2 

39-2 

6 

241 

17-8 

25-3 

25-3 

25-7 

7 

161 

101 

150 

15-6 

15-0 

8 

27.5 

190 

26-7 

26-7 

26-1 

9 

331 

24-3 

33-7 

33-3 

35-6 

10 

12-8 

10-6 

16-6 

170 

17-4 

11 

231 

15-2 

21-7 

21-9 

21-1 

12 

19-6 

11-4 

16-5 

16.9 

17-1 

13 

261 

19-7 

27-7 

27-7 

27-3 

14 

19-3 

12-7 

18-5 

18-9 

18-9 

15 

18-8 

12-6 

18-4 

18-8 

191 

16 

39.8 

31-8 

44-7 

43-9 

39-8 


m.s.e. 


46 

4-3 

3-6 
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The estimated, or predicted, x values in Table 8.10 were computed 
by treating every x ; , * = 1, —, 16, as the unknown x, and the 
remaining 15 observations as the previous results from the calibration 
experiment. The last line in the table gives the mean squared error 
calculated as (lAOZX*; — estimated x,) 2 . 

8.4.2 Two-way contingency tables 

Let Xjj be the frequency in row i column j of a two-way contingency 
table, i = 1, 2,..., n, j = 1, 2,... ,p. There are different ways in which 
Bayesian ideas can be used in the analysis of such data. We begin with 
an account of an approach described by Laird (1978), who refers to 
several others. Assume that the X u are multinomially distributed 
with EfXjj) = Nitij, where N is the total number of observations. 
Let 

In n i j = u 0 + u u + u 2 j + u 1 2 ij, 

and let = (u n , u 12 ,..., u ln _ J.Uj = (u 2u u 22> ... , u 2p _ x ) while u 12 is 
the vector of all u 12ij . Also let X be the vector of the cell frequencies 
and u T = (u{, u 2 , uf 2 ). Laird assumes that the u u and the u 2i are a 
priori independent with ‘flat’ distributions while the u 12ij are a priori 
independently and identically distributed with common N( 0, o 2 ). The 
posterior density of u is then 

/(u | x, a 2 ) = {m(x, o 2 )} " 1 /(x | uj/fuj 2 1 <x 2 ), 

where /(x|u) is the usual likelihood of the observed frequencies x, 
/(u 12 |<7 2 ) is the multivariate normal prior density of u 12 , and 

m(x, a 2 ) = 1 1(\ |u)/(u 12 1 o 2 )du. 

We may interpret m(x, <r 2 ) as the marginal likelihood of x given it 2 . 
For computational reasons it is suggested that the posterior mode, u* 
be taken as point estimate of u. A Bayes point estimate of n i} is then 

nfj = exp« + u*j + u* 2 ij) j Z ex P(“n + u *k + «i 2 »)• 

This Bayes estimate depends on o 2 . Moreover, in the model as 
formulated estimation of a 2 is possible, one way of doing it being by 
using the marginal likelihood m(x, o 2 ). Laird (1978) gives several 
methods of estimation of a 2 , all of them based on the idea of 



MISCELLANEOUS EB APPLICATIONS 


267 


Table 8.11 Numbers of deaths according to occupation (row) and 
cause (column ) 


Cause of death 


1 

10 

78 

7 

74 

14 

28 

111 

49 

46 

7 

38 

2 

281 

733 

70 

1129 

234 

332 

1185 

783 

500 

266 

608 

3 

58 

267 

28 

240 

39 

69 

294 

259 

134 

49 

295 

4 

3 

55 

2 

50 

7 

18 

59 

39 

17 

17 

13 

5 

5 

54 

2 

37 

6 

8 

44 

28 

15 

6 

7 

6 

1 

34 

3 

38 

4 

9 

26 

23 

13 

2 

1 

7 

7 

60 

0 

47 

3 

11 

43 

39 

14 

12 

15 

8 

16 

71 

2 

63 

7 

15 

78 

56 

38 

11 

17 

9 

3 

15 

0 

5 

0 

2 

18 

14 

4 

1 

4 

10 

40 

183 

7 

185 

31 

73 

249 

136 

77 

70 

91 

11 

7 

32 

2 

32 

4 

18 

60 

29 

24 

16 

10 

12 

9 

54 

2 

56 

6 

25 

66 

37 

23 

13 

16 

13 

9 

101 

4 

70 

13 

34 

82 

42 

49 

19 

18 

14 

8 

71 

0 

41 

1 

20 

54 

34 

28 

15 

10 

15 

13 

116 

5 

87 

10 

25 

115 

88 

52 

23 

14 

16 

3 

30 

0 

6 

0 

1 

4 

22 

2 

2 

3 


maximizing m(\,a 2 ). Replacing a 2 by the estimate d 2 in the formula 
above gives an empirical version of the Bayes estimate. 

Laird gives an application to a data set concerning male deaths 
classified according to occupation and cause of death; these data are 
reported in Good (1956). Table 8.11 gives a subset of the original 
collection of data. Numbers of deaths in 16 occupations (rows) and 11 
causes (columns) are shown. 

Laird reports an estimate d 2 = 0 078 from these data and gives the 
empirical Bayes expected frequencies for a selection of cells in 
Table 8.12 under the heading EB estimate. The cells chosen are those 
in which the observed frequencies are 0 and 3. 

In many instances a more natural Bayes approach would be 
indicated by the notion that the rows in the contingency table are 
randomly generated so that the k row probabilities may be supposed 
to have a fc-dimensional joint distribution. In the notation of 
section 4.5 the row probabilities are i = 1,..., p, and the n rows are 
thought of as a random sample of p-dimensional vectors. A tractable 

model for the prior distribution of (6 1 ,6 2 . 0 P ) is the Dirichlet 

distribution with parameters ct u a 2 ,...,a p . This leads to a different 
empirical Bayes approach of which details are given in section 4.5. In 
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essence it entails estimation of the parameters a h i= 1 using the 

n observed row vectors. 

Recall from section 4.5 that the method of moments gives the 
following equations, (4.5.12), for estimating the a,: 

(l/«) £ XyM = “jM = Pj (8.4.3) 

i= 1 

(V«) E X i)} 

J=11=1 

= {.I otj + A^j{A(A + l)}. (8.4.4) 

These equations can be solved as indicated in section 4.5 giving the 
result (4.5.8). 

Applying this method to the data in Table 8.11 is not entirely 
satisfactory because we have the rather large number of parameters, 
p =11, and only n = 13 observed vectors. Nevertheless, the estimates 
are as follows^ /i, = 0 0291, J 2 = 0-2092, /?, = 0 0077, /7 4 = 0-1588, 
/? 5 =0-0198, /? 6 = 0-0537, 0 7 = 0-1984, ^ = 0-1477, £, = 0-0794, 
)S 10 = 0-0387,)?!! = 0-0576, X = 90-30. 

The resulting EB estimates for the cells selected in Table 8.12 are 
shown under the heading (2). 


Table 8.12 Estimated cellfrequencies for certain i and j in Table8.4.2 
when 6 2 = 0-078 


i 

j 

Observed frequency 

EB estimate 

(2) 

(3) 

i 

3 

0 

1-88 

0-52 

0-37 

9 

3 

0 

0-55 

0-29 

0-24 

9 

5 

0 

1-29 

0-75 

— 

14 

3 

0 

2-09 

0.53 

0-37 

16 

3 

0 

0-57 

0-31 

0-25 

16 

5 

0 

1-33 

0-80 

— 

4 

1 

3 

6-25 

4-26 

3-88 

6 

3 

3 

1-48 

2-33 

2-51 

7 

5 

3 

4-73 

3-52 

— 

9 

1 

3 

1-99 

2-38 

2-50 

16 

1 

3 

2-05 

2-52 

2-62 

16 

11 

3 

3-61 

3-67 

— 
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The number of parameters to be estimated can be reduced by 
pooling cells in Table 8.11. For example, if we pool causes 4-11 we 
get k = 4 parameters estimated as follows: /Jj = 0 0291, /? 2 = 0-2092, 
/?3 = 0-0077, A = 57-44. 

The resulting EB estimates for the cells from columns 1-3 selected 
in Table 8.12 are shown under the heading (3). 

8.4.3 Stratified sampling 

Stratified sampling of populations is performed for various reasons, 
including convenience, the desire to improve precision, and also 
because results for individual strata may be of interest. In the latter 
event empirical Bayes ideas may be useful for smoothing out 
irregularities in individual estimates, especially if they are based on 
relatively sparse data. The justification for such an approach can be 
found in the notion that the stratum parameters are randomly 
assigned, if this seems appropriate. Alternatively one may appeal to 
an empirical Bayes justification of a compound decision approach. 

Typically we have in stratum i an estimate y t of a population 
characteristic subject to a variance of. This of will depend on the 
stratum sample size and on the size of the stratum, among other 
things. We shall assume that it is known, or that a reasonably good 
estimate of it is available. If the X t are taken to be randomly generated 
by a N(p G , of) distribution, and if the distribution of y,\ is N(X h of), 
then the methods of estimating p G and of are exactly like those 
described in section 8.2.1. When concomitant variables x t are ob¬ 
served, incorporating such information in an EB approach is possible 
as described in section 8.1. 

A good example of this type is described by Fay and Herriot (1979). 
It has to do with estimating per capita income (PCI) in a large 
number, approximately 39 000, of local government units. Many of 
these were places with populations smaller than 500 persons. Original 
estimates of PCI for 1970 were made on the basis of a 20% sample of 
the population so that the sampling errors for the small places can be 
substantial. Certain concomitant information was used, such as 1969 
tax-return data and data on housing from a 1970 census. A further 
point of considerable interest in the Fay and Herriot paper is that 
special censuses of a selection of the small places conducted in 1973 
made it possible to compare various 1972 PCI estimates with the 
supposedly true values; the authors refer to some problems associated 
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with the special census values. The EB estimates, actually referred to 
as James-Stein estimates by Fay and Herriot, exhibit superior 
performance relative to two other, more common, estimates. 

8.4.4 Empirical Bayes estimation of rates in longitudinal studies 

In longitudinal studies aimed at estimating rates of change it is not 
uncommon for observations over relatively short time intervals to be 
made on many subjects. A typical study of this sort is the assessment 
of rate of change in lung function of workers in a certain industry. 
Over the period of the study workers are tested from time to time and 
an attempt is made to estimate the rate of change for each worker. 
Suppose that on subject i responses y i} are measured at times t i} , j 
= 1,2,..., m,. Then a reasonable estimate of the subject’s rate of 
change is the regression coefficient b t obtained by fitting a least 
squares line to the (y , t) values of the individual subject. If the total 
time period over which the observations are taken is relatively short 
the subject’s true response curve can be assumed well approximated 
by a straight line over the observed time interval. The slope b, can be 
regarded as an estimate of the subject’s rate of change at time f„ the 
mean of the t tj values. 

Hui and Berger (1983) describe a study of this sort, its details being 
briefly as follows. As above, the observed least squares slope of subject 
i is taken as an estimate of a true slope at a point somewhere within 
the observed time range. Instead of taking this point as f„ however, 
Hui and Berger adjust it to £ ; = + (l/2)M 3j /M 2i , where M 3l and Af 2i 
are the third and second sample moments of the t-values of subject i. 
Thus b t is an estimate of /?,(£;) with variance v, = of/Jf jm i (ty ~ h) 2 - 
The prior distributional assumptions about the slopes /?((£;) are that 
they are generated by a N(y 0 + yi £,-, t 2 ) distribution. The specification 
of the model so far is exactly along the lines of the discussion of 
section 8.1, t, here being the concomitant variable. A novel aspect of 
the treatment by Hui and Berger is that the v ; values are not assumed 
known, or at least relatively accurately estimated. Instead, let sf be the 
usual unbiased estimate of of obtained from the fitting of a straight 
line by least squares to the data of subject i. Assume that (m f — 2 )sf is 
distributed like ofX 2 (mi — 2), and that the of values are generated 
by a prior distribution of the inverse gamma form with density 

7t(<r 2 )oc<7 _2M+1, exp {— C/(2cr 2 )}, 
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where A and B are parameters which can be estimated from the data. 
These estimates are incorporated to give a somewhat more compre¬ 
hensive EB method. 

An application of the approach described above to data on bone 
loss in postmenopausal women is given by Hui and Berger. 
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